Evaluation
Measure retrieval quality on your own data, prove changes help, and compare two knowledge bases on the same golden set. Source: ragforge/evaluation/.
Metrics
| Metric | What it measures |
|---|---|
hit_rate | Did any relevant chunk appear in top-k? 1.0 or 0.0. |
precision_at_k | Fraction of retrieved chunks that were relevant. |
recall_at_k | Fraction of all relevant chunks retrieved. |
mrr | Mean Reciprocal Rank — how high up was the first relevant chunk? |
faithfulness | LLM-judge: is the answer grounded in context? 0–1. |
answer_relevance | LLM-judge: does the answer address the question? 0–1. |
Golden dataset format
JSON (or CSV with the same column names). Only question is required.
golden.json
[
{
"question": "What is the refund window for electronics?",
"expected_answer": "14 days",
"relevant_chunk_ids": ["a1b2c3d4", "b2c3d4e5"],
"relevant_sources": ["refund_policy.md"],
"notes": "From the Electronics section"
},
{
"question": "Is free shipping available?"
}
]Run from the CLI
bash
ragforge eval run my-kb golden.json
ragforge eval run my-kb golden.json --metrics hit_rate,mrr --mode hybrid -k 5
ragforge eval run my-kb golden.json --generate --llm ollama # include judge metrics
ragforge eval compare my-kb-v1 my-kb-v2 golden.jsonRun from Python
python
from ragforge.pipeline import KnowledgeBase
from ragforge.evaluation import GoldenDataset, Evaluator, RETRIEVAL_METRICS
kb = KnowledgeBase.load("my-kb")
golden = GoldenDataset.load("golden.json")
evaluator = Evaluator(kb)
report = evaluator.run(golden, metrics=RETRIEVAL_METRICS, top_k=5, mode="hybrid")
report.print_table()
kb_v2 = KnowledgeBase.load("my-kb-v2")
comparison = Evaluator.compare(kb, kb_v2, golden, metrics=RETRIEVAL_METRICS)
Evaluator.print_comparison(comparison)Bootstrap a draft golden set
Generate a draft from an existing KB using an LLM. Output is marked DRAFT — always review before using as ground truth.
bash
ragforge eval bootstrap my-kb --n 20 --llm ollama --out draft_golden.json