Case study
When My Own Agent Said I Had No Weaknesses
A debugging case on the live RAG demo: the agent confidently said I had no weaknesses, even though I had written real failure arcs into every project doc. The retrieval was lying by omission.
AI WorkflowsPlatform Performance / Reliability
The challenge
A visitor asked the live demo "What are Arup's weaknesses?" and it answered that the corpus held no information about any, even though I had written real failure arcs into every project doc. On a portfolio whose whole pitch is honesty, a confident "no weaknesses found" is the worst answer it can give. It was lying by omission, and it was my own retrieval doing it.
Architecture approach
- ·Reproduced it against the test backend, not from the screenshot alone: the agentic path returned a fluent denial that cited five win-docs, so retrieval ran but pulled the wrong sections.
- ·Found the root cause: broad "about me" questions go through a deterministic fan-out that pre-retrieves evidence before the model answers. Setback questions were routed in correctly, but there was one query list and every query was phrased around wins (zero-downtime, latency, concurrency). Because each doc is chunked by heading, those queries kept landing on the Results sections and never the 'What went wrong' ones.
- ·Built a separate failure-targeted fan-out: one query per project that names the project and its failure language, so dense retrieval actually surfaces the 'what went wrong' chunk it was skipping.
- ·Caught a second bug while validating: the groundedness self-check only saw the six displayed citations, but the fan-out gathers eight, so a correct multi-project answer that drew on a seventh source got falsely flagged 'unverified.' Pointed the check at the evidence the model actually used.
- ·Wrote golden retrieval cases that lock each project's failure section as reachable. The suite had zero failure cases before this; it only ever tested wins, which is exactly why the regression shipped unseen.
Tech stack
BedrockPineconeNova ProEval harnessTypeScript
Results
- On every run the weakness probe now answers with real, cited failures from five or six projects: the single-row migration stall I had to purge and re-run, the reconnect storm I caught in load testing, the agent's own boasting-doc misfire. It no longer claims there are none.
- Golden retrieval eval at 23/23, now including four failure-recall cases that did not exist before.
- Running that golden set surfaced two more regressions hiding in recent work: session-scoped uploads had silently broken five upload eval cases (uploads are now only reachable with a matching session, and the eval queried session-less), and the honesty rewrite had removed a '90%' stat a case still asserted. Both fixed; an obsolete guard test that encoded a removed 'reserved upload slot' mechanism was re-pointed at the current 'uploads compete, never pollute' contract.
- Agent behavioral eval stabilized at 8/8 across repeated runs after I stopped gating it on a nondeterministic LLM judge.
What I'd do differently at production scale
- ·The denial shipped because my evals only tested wins. I had no 'ask about a failure' case, so nothing went red. At scale I'd treat the eval suite as part of the feature: every new content category ships with its own golden cases, and I'd write the negative and adversarial cases first.
- ·The self-check is an LLM-as-judge that still flips run-to-run at temperature zero. Gating CI on it produces flaky red for no signal. I'd make it advisory, log it and track it as a directional metric over many samples, and gate on deterministic checks like citation overlap or an NLI model instead.
- ·Dense-only retrieval will never naturally rank a section titled 'what went wrong' for the word 'weaknesses.' The fan-out is a deterministic patch over that gap. The durable fix is hybrid sparse-plus-dense retrieval, and indexing a short per-doc summary that states each failure in plain terms.
- ·The fan-out query list is hand-maintained per project, so it won't scale as the corpus grows. I'd generate those failure queries from the documents' own headings rather than hardcoding one per project.
- ·Two features broke evals that were simply never re-run. Those evals belong in CI on every change to the corpus or the retrieval library, so a feature can't merge while quietly turning a guard test red.