When a RAG feature goes sideways, the failure mode is rarely the base model alone. It is usually retrieval: wrong chunks, stale embeddings, or a prompt that no longer matches how your users phrase questions. Eval harnesses are how we make that visible before customers do.
Why evals are the product
Classic unit tests assert deterministic outputs. RAG systems are not deterministic in the same way, but they are still testable if you separate concerns: did we fetch the right evidence, did the model stay grounded, and did we meet latency budgets? We treat those as three different suites instead of one vague “quality score.”
That separation matters when you iterate quickly. If latency regresses, you do not want a flaky LLM-as-judge score masking a retrieval bug. We keep retrieval metrics (recall@k, MRR on labeled queries) distinct from answer faithfulness checks.
Golden sets that age well
Golden questions are only useful if they are maintained like code. We store them beside the service in version control with explicit owners, expected citations (document IDs and chunk ranges where possible), and tags for domain and difficulty.
- Minimum viable gold: start with 30–50 high-signal queries from real support logs or sales calls — not generic FAQs.
- Negative cases: include prompts that should refuse or escalate; grounding evals are as important as happy paths.
- Drift alarms: when weekly eval pass rates drop more than a few points, block release and diff the corpus ingest pipeline first.
If your eval set never embarrasses you, it is not representative.
Gates in CI, not in Slack
We run lightweight retrieval checks on every pull request that touches ingest, chunking, metadata, or the query planner. Heavier LLM-judge suites run nightly or on demand because they are slower and cost real tokens.
CI gates publish a short markdown summary to the PR: top failing queries, whether failures cluster on a tenant or locale, and links to the trace IDs. The goal is to make fixes obvious without asking people to dig through notebooks.
Human spot-checks that scale
Automation cannot catch everything — especially subjective tone or clinical nuance in regulated domains. We sample a fixed percentage of production traffic (with consent and redaction policies) for human review, stratified by confidence scores from the stack.
Reviewers use a one-page rubric so scores stay comparable month to month. Findings feed back into gold sets so the harness tightens over time instead of rotting.
Closing the loop
The pattern is simple: treat evaluation artifacts as production dependencies, branch them with code, and fail builds when behavior drifts. The hard part is cultural — teams have to agree that slowing a release beats shipping silent retrieval regressions. We have found that once engineers see a single eval catch a real outage, the investment pays for itself.
If you are standing up RAG and want a second pair of eyes on your eval plan, tell us what you are shipping — we are happy to help you prioritize the first metrics that actually protect users.