Evaluation

Measure retrieval quality on your own data, prove changes help, and compare two knowledge bases on the same golden set. Source: ragforge/evaluation/.

Metrics

MetricWhat it measures
hit_rateDid any relevant chunk appear in top-k? 1.0 or 0.0.
precision_at_kFraction of retrieved chunks that were relevant.
recall_at_kFraction of all relevant chunks retrieved.
mrrMean Reciprocal Rank — how high up was the first relevant chunk?
faithfulnessLLM-judge: is the answer grounded in context? 0–1.
answer_relevanceLLM-judge: does the answer address the question? 0–1.

Golden dataset format

JSON (or CSV with the same column names). Only question is required.

golden.json
[
  {
    "question":           "What is the refund window for electronics?",
    "expected_answer":    "14 days",
    "relevant_chunk_ids": ["a1b2c3d4", "b2c3d4e5"],
    "relevant_sources":   ["refund_policy.md"],
    "notes":              "From the Electronics section"
  },
  {
    "question": "Is free shipping available?"
  }
]

Run from the CLI

bash
ragforge eval run my-kb golden.json
ragforge eval run my-kb golden.json --metrics hit_rate,mrr --mode hybrid -k 5
ragforge eval run my-kb golden.json --generate --llm ollama  # include judge metrics
ragforge eval compare my-kb-v1 my-kb-v2 golden.json

Run from Python

python
from ragforge.pipeline import KnowledgeBase
from ragforge.evaluation import GoldenDataset, Evaluator, RETRIEVAL_METRICS

kb = KnowledgeBase.load("my-kb")
golden = GoldenDataset.load("golden.json")

evaluator = Evaluator(kb)
report = evaluator.run(golden, metrics=RETRIEVAL_METRICS, top_k=5, mode="hybrid")
report.print_table()

kb_v2 = KnowledgeBase.load("my-kb-v2")
comparison = Evaluator.compare(kb, kb_v2, golden, metrics=RETRIEVAL_METRICS)
Evaluator.print_comparison(comparison)

Bootstrap a draft golden set

Generate a draft from an existing KB using an LLM. Output is marked DRAFT — always review before using as ground truth.

bash
ragforge eval bootstrap my-kb --n 20 --llm ollama --out draft_golden.json