Case study

When My Own Agent Said I Had No Weaknesses

A debugging case on the live RAG demo: the agent confidently said I had no weaknesses, even though I had written real failure arcs into every project doc. The retrieval was lying by omission.

AI WorkflowsPlatform Performance / Reliability

The challenge

A visitor asked the live demo "What are Arup's weaknesses?" and it answered that the corpus held no information about any, even though I had written real failure arcs into every project doc. On a portfolio whose whole pitch is honesty, a confident "no weaknesses found" is the worst answer it can give. It was lying by omission, and it was my own retrieval doing it.

Architecture approach

·Reproduced it against the test backend, not from the screenshot alone: the agentic path returned a fluent denial that cited five win-docs, so retrieval ran but pulled the wrong sections.
·Found the root cause: broad "about me" questions go through a deterministic fan-out that pre-retrieves evidence before the model answers. Setback questions were routed in correctly, but there was one query list and every query was phrased around wins (zero-downtime, latency, concurrency). Because each doc is chunked by heading, those queries kept landing on the Results sections and never the 'What went wrong' ones.
·Built a separate failure-targeted fan-out: one query per project that names the project and its failure language, so dense retrieval actually surfaces the 'what went wrong' chunk it was skipping.
·Caught a second bug while validating: the groundedness self-check only saw the six displayed citations, but the fan-out gathers eight, so a correct multi-project answer that drew on a seventh source got falsely flagged 'unverified.' Pointed the check at the evidence the model actually used.
·Wrote golden retrieval cases that lock each project's failure section as reachable. The suite had zero failure cases before this; it only ever tested wins, which is exactly why the regression shipped unseen.

Tech stack

BedrockPineconeNova ProEval harnessTypeScript

Results

On every run the weakness probe now answers with real, cited failures from five or six projects: the single-row migration stall I had to purge and re-run, the reconnect storm I caught in load testing, the agent's own boasting-doc misfire. It no longer claims there are none.
Golden retrieval eval at 23/23, now including four failure-recall cases that did not exist before.
Running that golden set surfaced two more regressions hiding in recent work: session-scoped uploads had silently broken five upload eval cases (uploads are now only reachable with a matching session, and the eval queried session-less), and the honesty rewrite had removed a '90%' stat a case still asserted. Both fixed; an obsolete guard test that encoded a removed 'reserved upload slot' mechanism was re-pointed at the current 'uploads compete, never pollute' contract.
Agent behavioral eval stabilized at 8/8 across repeated runs after I stopped gating it on a nondeterministic LLM judge.

What I'd do differently at production scale

·The denial shipped because my evals only tested wins. I had no 'ask about a failure' case, so nothing went red. At scale I'd treat the eval suite as part of the feature: every new content category ships with its own golden cases, and I'd write the negative and adversarial cases first.
·The self-check is an LLM-as-judge that still flips run-to-run at temperature zero. Gating CI on it produces flaky red for no signal. I'd make it advisory, log it and track it as a directional metric over many samples, and gate on deterministic checks like citation overlap or an NLI model instead.
·Dense-only retrieval will never naturally rank a section titled 'what went wrong' for the word 'weaknesses.' The fan-out is a deterministic patch over that gap. The durable fix is hybrid sparse-plus-dense retrieval, and indexing a short per-doc summary that states each failure in plain terms.
·The fan-out query list is hand-maintained per project, so it won't scale as the corpus grows. I'd generate those failure queries from the documents' own headings rather than hardcoding one per project.
·Two features broke evals that were simply never re-run. Those evals belong in CI on every change to the corpus or the retrieval library, so a feature can't merge while quietly turning a guard test red.