Lessons from Building a Legal Research Agent on Claude Sonnet 4

When Banco Núñez engaged us to build a legal research platform on top of sixty years of historical banking records, the core problem wasn't AI — it was access. The bank had spent decades accumulating correspondence, transaction logs, internal memos, and financial statements relevant to ongoing Helms-Burton litigation. The materials existed. Lawyers couldn't get to them fast enough when preparing depositions or building case strategy.

We built a Claude-powered agent that lets counsel ask plain-English questions and get cited answers across the full archive while preparing cases. This post covers what we built, why we picked the Anthropic Claude API directly, and three real problems we hit on the way to production.

What the system does

A lawyer types a question — in English or Spanish — into a chat interface. Claude responds with a sourced answer: the relevant excerpt, the document it came from, and the page reference. The lawyer can keep asking follow-up questions in the same session, and Claude maintains the context of the conversation so they don't have to keep re-establishing what case they're researching.

Behind the scenes, Claude doesn't just see one chunk of retrieved text and answer. It calls search tools, evaluates what came back, refines its search if needed, and synthesizes across multiple documents before responding.

The stack

Claude Sonnet 4 / 4.5 via the direct Anthropic API
Postgres with pgvector for semantic search across 1,500+ documents
AWS Textract for OCR — including handwritten and aged paper records
AWS S3 for encrypted document storage with role-based access control
Postgres for conversation persistence across sessions

We chose the direct Anthropic API over Bedrock for two practical reasons: faster prompt iteration during development, and immediate access to the latest Claude model versions as they ship. For a build where prompt quality determines citation accuracy, that iteration speed mattered.

Why Claude

Three things made Claude the right reasoning layer for this:

Long context windows. Legal questions often need three or four full documents in context, not chunked snippets. Claude's context budget let us hand it complete documents when we needed to — which mattered for cross-document synthesis (connecting a 1972 transaction record to a 1981 letter to a 1994 financial statement).

Citation behavior under instruction. We found that Claude reliably grounds answers in retrieved source material when the system prompt is explicit about citation format. We invested heavily in that prompt design — but the model met us most of the way.

Bilingual handling. The archive is roughly half English, half Spanish. Lawyers query in either language. Older models would have needed routing logic or separate pipelines per language. Claude handles cross-language retrieval and answers natively.

The agent loop

Claude has a small set of search tools available on every turn:

search_documents(query) — semantic search via pgvector
get_document_text(id) — full-text retrieval of a specific document
list_related_docs(id) — surface documents similar to one already retrieved

For a given question, Claude decides which tool to call, sees the results, and decides whether to call another. Simple lookups resolve quickly. Complex synthesis questions — connecting a 1972 transaction record to a 1981 letter to a 1994 financial statement — require multiple iterations as Claude gathers what it needs and decides whether it has enough to answer.

This is what we mean by "agentic RAG." Claude isn't just being stuffed with retrieved chunks and asked to answer. It's reasoning about what it needs, fetching it, and reasoning about whether what came back is enough.

Three things that were harder than expected

1. OCR quality on aged and handwritten documents

A surprising amount of the archive is sixty-year-old typewriter pages, faded carbon copies, and handwritten margin notes. Textract handles modern documents beautifully and struggles with the older ones. Wrong characters, misread numbers, and dropped accents in Spanish text propagate forward — they end up in the embeddings, in retrieval results, and eventually in Claude's context.

We mitigated this in two layers. First, a post-OCR cleanup pass that flags low-confidence regions and runs them through a separate processing path. Second, we ask Claude in the system prompt to explicitly note OCR uncertainty when it sees garbled text in retrieved results, rather than confidently citing something that might be wrong. It's not perfect, but it's far better than treating OCR output as ground truth.

2. Conversation context bloat

Conversation persistence sounds simple until users have a 30-turn session. By turn 15, the prompt is dragging dozens of previously-retrieved documents and earlier exchanges with it, and both latency and cost climb fast.

We solved this with a context management layer between the user and Claude: as a conversation grows, older turns get summarized into a compact context block, while the most recent turns and any document content from the current line of inquiry stay verbatim. We tested aggressively to find the right summarization boundaries — too eager and you lose context the lawyer needs; too lazy and the prompt explodes.

The lesson: conversation persistence isn't a feature you turn on, it's a system you design. Anyone building production chat agents on Claude (or any LLM) hits this somewhere between turn 10 and turn 20.

3. Latency under tool-calling

Each tool call is a round trip. A complex query that needs several searches and a final synthesis is multiple API calls in series — which means total latency is the sum, not the max. A user asking a hard question can wait 8–12 seconds for an answer, which is a noticeable amount of time when you're staring at a chat interface.

We made progress on three fronts: parallelizing tool calls when Claude requests multiple at once (it sometimes does), pre-warming the embedding cache for documents the system predicts will be hit, and streaming the final synthesis back to the user so they see tokens arriving before the full answer is ready. None of this makes the underlying problem go away — agent loops are inherently more latent than single-shot RAG — but the perceived experience is materially better.

What we'd do differently

If we started over today, we'd invest in evaluation infrastructure on day one rather than month two. We built a test set of representative legal questions with known correct answers eventually, but we built it after we'd already shipped to production. Having that harness from the beginning would have caught a class of citation regressions we discovered the slow way.

We'd also have tested earlier with the actual lawyers using the system, not just our internal team. Lawyers ask questions in a particular way that's quite different from how engineers think to test. The first few weeks of real usage taught us more about prompt design than the prior month of internal testing.

What's next

The Banco Núñez platform is now in active use across the firm's litigation team. We're running similar Claude deployments for two other clients — operational AI agents for SFM Services, and customer-segmentation reasoning for Nayo Mobile Grooming. Across all three, we're calling Claude through the direct Anthropic API.

If your organization has a large document archive that should be queryable in plain English — legal records, regulatory filings, internal knowledge bases, technical documentation — the same architecture applies.

Building something similar?

We design and build production AI agent systems on Claude for mid-market and enterprise clients. If you have a document archive, an operational workflow, or a customer-facing problem that would benefit from agentic AI, we're happy to talk through what's realistic and what it would take.

Schedule a Discovery Call →

Lessons from building a legal research agent on Claude Sonnet 4

What the system does

The stack

Why Claude

The agent loop

Three things that were harder than expected

1. OCR quality on aged and handwritten documents

2. Conversation context bloat

3. Latency under tool-calling

What we'd do differently

What's next

Building something similar?