Decisions

Architecture decisions

The reasoning behind the live Portfolio Ops system, recorded as decisions with their alternatives and their costs. You can check each one against the running system.

Every record names the alternatives I rejected and the costs I took on. Where a decision is visible in the live system, there is a link to go check it.

ADR-001Shipped

Pinecone serverless for vector search, not pgvector or OpenSearch

Context

The corpus is small: about ten documents and a few dozen chunks, plus short-lived visitor uploads. One person maintains it, and the demo has to stay cheap while idle. The retrieval layer should not turn into an operations job.

Decision

Pinecone serverless, with 1024-dim Titan v2 embeddings and cosine similarity. searchSimilar() reserves a few upload-priority slots so a visitor's freshly uploaded document stays reachable even when permanent corpus chunks score higher.

Alternatives considered

pgvector on RDS: another stateful service to run and scale, which is overkill at this volume.
OpenSearch or self-hosted vectors: heavier to operate and slower to stand up, with cost even at rest.
An in-memory index in the Lambda: nothing persists across cold starts, and uploads have no shared state.

Consequences

Almost no cost while idle, and no index to operate. That suits a solo-run demo.
Embeddings run in ap-south-2 while the index lives in us-east-1, so each search pays a cross-region hop. That latency shows up in Inspect.
If the corpus ever grew large, the right move is to colocate the regions and add sparse-dense vectors. Written down, not pre-built.

See per-query retrieval in /ops Inspect

ADR-002Shipped · evolving

Dense-only retrieval with a relevance floor; hybrid/BM25 is the earned next step

Context

Dense cosine handles semantic questions well; real matches in the golden suite score between 0.33 and 0.79. But a short, keyword-heavy query exposed the classic dense-only gap. When nothing relevant is indexed, the closest documents still come back around 0.10, which is noise, and the model, handed that thin context, can write a confident denial. On a demo whose whole pitch is transparent retrieval, that is the worst thing that can happen.

Decision

Keep dense-only as the base and add a calibrated relevance floor. When the top cosine score drops below about 0.20, the system says "no strong match in the index" instead of letting the model speak from noise. Hybrid retrieval (BM25 or sparse fused with dense) is written down as the next step. The failure earned it, rather than a roadmap guess.

Alternatives considered

Ship hybrid right away: more infrastructure and a reindex, so it was deferred until the floor proved insufficient.
Do nothing, and leave the model free to deny things from noise. Rejected, because it breaks the honesty invariant.

Consequences

Off-corpus questions now get an honest empty-context answer, with the top score shown in Inspect.
Keyword recall for rare tokens stays limited until hybrid lands. That gap is written down, not hidden.
The floor is a single configurable number, calibrated against the eval suite so it never trips a real match.

Top-score + below-floor signal in /ops Inspect

ADR-003Shipped

Amazon Nova Pro + Titan Embed v2 on Bedrock

Context

The system needs embeddings, generation, and tool-use on AWS-native infrastructure, with one IAM and billing surface and a predictable cost.

Decision

Titan Embed Text v2 (1024-dim) for embeddings and Nova Pro for generation, both on Bedrock. Generation goes through the Converse API, which gives native tool-use. That is what made the agentic loop possible without bolting on a separate framework.

Alternatives considered

External OpenAI or Anthropic APIs: cross-cloud egress, plus a second billing and key-management surface.
Self-hosted open models: inference operations and GPU cost a demo does not justify.

Consequences

One cloud, one IAM model, one bill, so least-privilege per Lambda stays simple.
Converse tool-use unlocked the agent directly (see ADR-004).
Model choice is now mostly about cost and latency rather than capability, and the readout in Inspect makes that explicit.

Token usage + timings in /ops Inspect

ADR-004Shipped

A bounded agentic loop with a self-check, not an open-ended ReAct agent

Context

The hero claim is agentic systems, so the demo should run a real agent. But the most common production failure for agents is the runaway loop: unbounded tool calls that burn latency and money. Multi-agent sprawl would add risk here without adding signal.

Decision

One read-only retrieve tool, a hard cap of three tool iterations, and an LLM-as-judge self-check that confirms the answer is grounded in retrieved context before it is finalized. Below-floor or ungrounded answers are flagged honestly, never invented. Every step is emitted as a visible trace.

Alternatives considered

An unbounded ReAct loop: open-ended cost and latency, the exact failure this guards against.
Multi-agent orchestration: more moving parts, no better answers at this scale.
No self-check: faster, but it lets ungrounded claims through.

Consequences

A predictable cost and latency ceiling per query, usually one or two tool calls.
Restraint instead of theatrics. The hard cap is the safety story, and it is what the agent-eval asserts on.
A genuinely hard question can stop a step early. That is an accepted trade for bounded, inspectable behavior.

Live trace + Agent Evals (7/7) in /ops

ADR-005Shipped

Session-scoped, TTL-expiring, quota-bounded visitor uploads

Context

Visitors can upload a document and query it live. Anything public that touches storage and a model has to be safe by default. It cannot leak across visitors, run up unbounded cost, or leave data lying around.

Decision

Uploads are scoped to a session and expire on a 24-hour TTL (an S3 lifecycle rule plus a Pinecone cleanup path). They sit behind per-day global upload and query quotas, with a corpus-busy lock during index mutations. The retrieve tool is read-only, so an instruction injected into an uploaded file has no action it can take.

Alternatives considered

Persistent uploads: storage, cost, and privacy obligations a demo should not carry.
No quotas: an open door to abuse and runaway spend.
A shared, unscoped index: cross-visitor data leakage.

Consequences

Safe to leave open to the public. The blast radius of a malicious upload is one short-lived session.
Ephemerality becomes a feature, since the teardown is part of the demo.
By design, it is not a durable multi-user document store.

Upload → query → delete in /ops Upload

ADR-006Shipped

The corpus panel is generated from the live index — credibility is the product

Context

The homepage panel that shows what's indexed started as a hardcoded list, and it drifted. It advertised documents that were never indexed while hiding the ones that were. A visitor could ask about an advertised document and watch the model correctly say it isn't there, a contradiction sitting two inches from the claim.

Decision

Generate the panel from a GET /corpus endpoint that lists what is actually in the vector store (a Pinecone prefix scan), with a real-corpus static fallback. The panel cannot advertise a document that isn't indexed.

Alternatives considered

A hardcoded or hand-curated list, which is the thing that drifted in the first place.

Consequences

The panel can never show a document that isn't there, and new documents appear once they are indexed.
One cheap, cached API call on load.
This is the honesty invariant in practice: the whole demo's value is that nothing on screen is faked.

The corpus panel on the homepage

ADR-007Shipped

Defense in depth against prompt injection (OWASP LLM01)

Context

The agent reads visitor-uploaded documents and its own corpus, then feeds that text to a model. Prompt injection is OWASP's top LLM risk: a document or a question that says "ignore your instructions and do X." Neither RAG nor fine-tuning removes it, so it needs layered defense rather than a single filter.

Decision

Four layers. The retrieve tool is read-only, so there is no privileged action to hijack. The system prompt treats the question and all retrieved text as data, never as instructions. An injection-pattern guard surfaces attempts in the trace so they are visible rather than silent. The grounding self-check and the relevance floor keep the answer tied to real sources, and input is stripped of control tokens before it reaches the model.

Alternatives considered

A single regex filter as the defense: brittle, and it gives false confidence. Detection here only adds visibility on top of the structural layers.
Trust the model to behave: injection is a known, repeatable failure, and hope is not a control.

Consequences

An injected instruction has no action it can take and does not change the agent's behavior; the attempt shows up in the trace as a guard step.
Pattern detection will miss novel phrasings, which is exactly why it is the visible layer and not the load-bearing one.
With session isolation and TTL (ADR-005), the blast radius of a malicious upload stays inside one short-lived session.

Type an injection into /ops Ask and watch the guard step