← koiro.me

RAG Evaluation RAGAS Agentic-RAG

As RAG moved into production in 2024, the question of how to evaluate it has become just as important as how to build it. Retriever-generator alignment, context trustworthiness, Agentic RAG evaluation — here's where the field stands as of April 2026.

2024 → 2026: What Changed

Area20242026
Evaluation axesRetrieval + GenerationRetrieval + Generation + Context Trustworthiness
New metricsnDCG, Precision@K, Hit Rate+ CUE, Citation Coverage, Tool Selection Accuracy
Evaluation targetsStandard RAG+ Agentic RAG, GraphRAG, Hybrid RAG
Evaluation methodLLM-as-Judge (early stage)Dimension-isolated evaluation + cost optimization standardized
CI/CD integrationMostly manualAutomated quality gates now the norm
Blind spots discoveredUnknownPragmatic Misleading, Accuracy Fallacy

The core shift is clear: from “how well did it retrieve + how well did it generate” toward how well are retrieval and generation aligned and can the retrieved information actually be trusted.


The 5 Core Metrics and 2026 Production Thresholds

Research published in JMLR shows that retrieval accuracy alone explains only 60% of RAG quality variance. The remaining 40% comes from how well the model utilizes the retrieved context.

MetricMeasuresRecommended ThresholdLow Score Diagnosis
FaithfulnessGeneration0.8+ (regulated industries: 0.9+)Model is filling gaps with training knowledge
Answer RelevanceGeneration0.75+Relevant but imprecise chunks being retrieved
Context PrecisionRetrieval0.7+Re-ranking stage needed
Context RecallRetrieval0.75+Chunk size too small or top-K too low
Hallucination RateGeneration<5%Check recent document ingestion quality
A Faithfulness score of 0.6 means roughly 40% of statements in the answer are not grounded in the retrieved content. That is hallucination in the strictest technical sense.

Production Monitoring Thresholds

MetricAlert ThresholdCheck
Faithfulness (sampled)< 0.75Recent document ingestion quality
Answer Relevance (sampled)< 0.70Query distribution shift
Hallucination Rate> 5%Retrieval coverage for new query types
P95 Retrieval Latency> 500msIndex size, embedding model load
Context Utilization< 40%Chunk size, overlap settings (new in 2026)
User Negative Feedback Rate> 10%Audit all of the above

Key New Research in 2026

RAG-E (Jan 2026) — Measuring Retriever-Generator Alignment

Empirical analysis across TREC CAsT and FoodSafeSum revealed a striking misalignment:

  • 47.4–66.7% of queries had the generator ignoring the retriever’s top-ranked documents
  • 48.1–65.9% relied on lower-ranked documents instead

The implication is direct: evaluating the retriever and generator in isolation is not enough. Retriever-generator alignment must be treated as its own evaluation dimension.

RAG-X (Mar 2026) — Diagnostic Framework for Medical QA

This paper introduced Context Utilization Efficiency (CUE) as a new metric.

“Evaluating top-performing RAG pipelines, we found that 22% of retrieved evidence was duplicated — wasting the model’s limited context window.”

The key concept is the ‘Accuracy Fallacy’: a scenario where a system appears highly accurate but is in fact not grounded. CUE measures the proportion of retrieved context that actually contributed to the answer, exposing this blind spot.

SoK: Agentic RAG (Mar 2026)

“Traditional metrics evaluate the ’engine’ — the LLM’s final text output. Agentic evaluation must assess the ‘car’ — the entire system’s behavior across planning, tool use, and environmental interaction.”

BLEU and ROUGE focus on lexical overlap. They cannot capture the interactive, iterative behavior of agentic systems.


The 6 Fundamental RAG Failure Modes

Analysis of dozens of real-world RAG implementations found that systems don’t fail in countless random ways — they fail in exactly 6 fundamental ways.

  1. Bad retrieval — irrelevant documents are fetched
  2. Poor ranking — relevant documents exist but rank too low to be used
  3. Context overload — too much context causes the model to miss what matters
  4. Stale knowledge — the index isn’t up to date
  5. Evaluation blind spots — the metrics themselves have limits
  6. Retriever-generator misalignment — good retrieval, but generation ignores it

Standard RAG’s Blind Spots

The “Pragmatically Misleading” Problem

A 2025 study found that Microsoft Copilot provided medically incorrect or potentially harmful advice in 26% of questions about the 50 most commonly prescribed medications.

More troubling: a separate study found that RAG systems can remain “pragmatically misleading” even when citing accurate sources without hallucination — by decontextualizing facts, omitting critical sources, or reinforcing patient misunderstandings.

Standard RAG metrics (faithfulness, relevance) would score these outputs as passing. A domain expert would not.

Context Trustworthiness — The 5th Evaluation Dimension

The standard four metrics (faithfulness, relevance, precision, recall) assume a trustworthy index. But the trustworthiness of the index itself — ownership, freshness, provenance integrity — goes unmeasured.

A system can score 0.95 on faithfulness and still return wrong business answers if the retrieved content is outdated or misaligned with authoritative sources.


Tool Selection Guide for 2026

FrameworkBest ForCI/CD IntegrationLicense
RagasQuick experiments, standard metricsManual setupApache 2.0
DeepEvalCI/CD testing, production gatespytest-nativeMIT
TruLensDev-time monitoring, A/B experimentsNot supportedMIT
Maxim AIAll-in-one platformAuto tracingCommercial
PhoenixObservability-focusedAvailableOpen source

Recommended stack:

  • CI/CD quality gates → DeepEval
  • Metric exploration + synthetic dataset generation → Ragas
  • Production monitoring → TruLens or Langfuse

5 Common Evaluation Mistakes

1. Using the same model to generate and evaluate If GPT-4o writes the answer and scores it, scores inflate. Use a different model or size as the judge.

2. Skipping component-level evaluation End-to-end accuracy tells you something is wrong. Separating retrieval and generation metrics tells you where.

3. Using evaluation metrics as optimization targets Metrics are for tracking quality, not optimizing against. Optimizing toward metrics games them.

4. Evaluating multiple dimensions in a single prompt Don’t ask one LLM call to assess context relevance, faithfulness, and answer relevance simultaneously. Isolated rubrics per dimension produce more consistent results.

5. Skipping human review of synthetic datasets LLMs generate plausible-looking test cases that contain factual errors or reference content that doesn’t exist.


References