Benchmarks & Comparison
Concrete evidence over adjectives. Quality on a public QA set, end-to-end latency, peak memory, and dollar cost — measured with the same evaluation harness you can run on your own data. Reproduce everything with the commands below.
Benchmark setup
| Setting | Value |
|---|---|
| Corpus | MS MARCO v2.1 dev — 8,841 passages, 1,000 sampled questions |
| Hardware | AWS c7i.2xlarge (8 vCPU, 16 GB RAM), no GPU |
| Embedder | BAAI/bge-small-en-v1.5 (384-d) for all frameworks |
| Reranker (when on) | BAAI/bge-reranker-base, top-20 → top-5 |
| LLM (generation) | gpt-4o-mini at temperature 0 |
| Versions | ragforge 0.4 · langchain 0.3.7 · llama-index 0.12.5 · haystack 2.8.0 |
Every framework uses the same chunks, embedder, top-k and reranker. Differences come from pipeline orchestration and retrieval strategy, not model choice. Source: benchmarks/.
Retrieval quality
Hybrid (BM25 + dense) retrieval, top-k = 5, with reranking enabled.
| Framework | Hit@5 | MRR@10 | nDCG@10 | Faithfulness* |
|---|---|---|---|---|
| RAGForge | 0.842 | 0.671 | 0.704 | 0.91 |
| LangChain (EnsembleRetriever) | 0.798 | 0.612 | 0.658 | 0.86 |
| LlamaIndex (QueryFusion) | 0.811 | 0.629 | 0.671 | 0.88 |
| Haystack (Hybrid pipeline) | 0.804 | 0.621 | 0.664 | 0.87 |
* LLM-judge faithfulness on 200 generated answers — gpt-4o as judge. Higher is better on all four metrics.
End-to-end latency
Single query, warm cache, p50 / p95 over 1,000 runs. Times include retrieval + rerank + prompt assembly. Generation latency excluded (identical LLM call).
| Framework | p50 (ms) | p95 (ms) | Relative |
|---|---|---|---|
| RAGForge | 47 | 92 | 1.0× |
| LlamaIndex | 78 | 164 | 1.66× |
| Haystack | 91 | 198 | 1.94× |
| LangChain | 118 | 241 | 2.51× |
RAGForge wins on latency because the registry resolves components once at KB load time — retrievers, rerankers and stores are plain Python objects after that, with no per-query dict lookups, no LCEL graph walk, and no per-call schema validation.
Memory usage
Resident set size after loading the 8,841-chunk index and answering 100 queries.
| Framework | Peak RSS | Index on disk |
|---|---|---|
| RAGForge | 412 MB | 28 MB |
| LlamaIndex | 684 MB | 41 MB |
| Haystack | 812 MB | 39 MB |
| LangChain + FAISS | 597 MB | 33 MB |
The default InMemoryStore uses a single contiguous float32 matrix and a flat vectors.json on disk — no Arrow tables, no per-chunk Python objects held in a separate doc store.
Multi-agent cost reduction
On a 50-question complex-research workload (multi-hop questions over a 200-document corpus), the multi-agent coordination layer shares retrieved evidence on a blackboard so agents don't re-query and re-read the same chunks.
| Configuration | LLM tokens | Retrievals | Total cost | Δ vs naïve |
|---|---|---|---|---|
| Naïve sequential agents | 1,284,300 | 612 | $0.71 | — |
| LangGraph (shared state) | 892,100 | 418 | $0.49 | −31% |
| RAGForge blackboard | 541,800 | 237 | $0.30 | −58% |
Savings come from stigmergic deduplication: every retrieval result is hashed and posted to the blackboard. Before an agent issues a query, it checks the board for a semantically-equivalent prior result (cosine ≥ 0.94 on the query embedding). Cache hits skip the retriever entirely and reuse the cited chunks.
Why choose RAGForge over LangChain / LlamaIndex / Haystack?
| Capability | RAGForge | LangChain | LlamaIndex | Haystack |
|---|---|---|---|---|
| Zero required deps (core install) | ✅ | ❌ | ❌ | ❌ |
| Built-in evaluation + golden sets | ✅ | Partial (langsmith) | Partial | ✅ |
| Safe embedder migration with validation | ✅ | ❌ | ❌ | ❌ |
| Multi-agent blackboard w/ dedup | ✅ | Manual | Manual | ❌ |
| A/B compare two KBs out of the box | ✅ | ❌ | ❌ | ❌ |
| Pluggable registry (no class hierarchy) | ✅ | ❌ | Partial | Partial |
| CLI + HTTP + Python parity | ✅ | Python only | Python only | Python + REST |
| Single-file vector store (no daemon) | ✅ | ❌ | Partial | ❌ |
Pick RAGForge if you…
- Need to prove a change helped — eval and A/B compare are first-class, not a separate SaaS.
- Ship in environments where you can't run a vector DB daemon (edge boxes, CI, notebooks).
- Have to swap embedding models in production without a re-index outage.
- Want one mental model across CLI, library, and HTTP — same components, same names.
Architecture at a glance
Single pipeline. Pluggable at every stage. The dashed boxes are optional.
INPUTS PIPELINE OUTPUTS
┌────────────────┐ ┌──────────────────────────────┐ ┌──────────────────┐
│ PDF / MD / │ │ Parser → Chunker │ │ Ranked chunks │
│ HTML / DOCX │──▶ │ │ │ │ ─▶ │ + citations │
│ URL / Folder │ │ ▼ ▼ │ │ │
└────────────────┘ │ Document Chunk[] │ │ Grounded answer │
│ │ │ │ (optional LLM) │
┌────────────────┐ │ ▼ │ │ │
│ Question │──▶ │ Embedder → Vector Store │ ─▶ │ Eval report │
│ (text) │ │ │ │ │ (optional) │
└────────────────┘ │ ▼ │ └──────────────────┘
│ Retriever (BM25 │ Dense │ │
│ Hybrid) │
│ │ │
│ ▼ │
│ Reranker → Prompt Builder │
│ │ │
│ ┌─────┴─────┐ │
│ ▼ ▼ │
│ ┌ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ┐ │
│ LLM Evaluator │
│ └ ─ ─ ─ ┘ └ ─ ─ ─ ─ ┘ │
└──────────────────────────────┘
▲
│
┌─────────────┴──────────────┐
│ Registry (plugin lookup) │
│ Blackboard (agent memory) │
└────────────────────────────┘Every box in the pipeline is a registry entry. Swap retriever="hybrid" for retriever="dense", or register a custom one with @register("retriever", "mine"). The Registry and Blackboard are the only two cross-cutting primitives.
What's actually new
Deliberately scoped — these are the parts we'd point a reviewer at. Implementations live in the linked source.
1. Shadow-index migration with validation
Most frameworks treat embedder swaps as "drop the index and re-ingest." RAGForge builds a shadow store with the new model, probes both stores with N real chunks, measures top-k overlap, and only cuts over after writing a backup. The quality delta is returned in the response — see migrator.py. We are not aware of another open-source RAG framework that does this in-process.
2. Stigmergic blackboard for multi-agent retrieval
Inspired by ant-colony stigmergy: agents leave structured "pheromone" traces (query embedding, returned chunk hashes, score, timestamp) on a shared board. New retrieval calls first check the board for a near-duplicate query (cosine ≥ τ, default 0.94) and reuse the cited evidence when found. This is what produces the 58% cost reduction above. The exact decay and similarity threshold tuning is intentionally not described here. Entry point: blackboard.py.
3. Dual-judge evaluation with disagreement gating
Faithfulness and answer-relevance are scored by two independent LLM judges. When they disagree by more than a configured margin, the example is flagged for human review instead of silently averaged. This sharply reduces silent regressions from prompt tweaks. See judges.py.
4. Structure-aware chunking with section inheritance
The default chunker walks heading levels and attaches the full breadcrumb (H1 ▸ H2 ▸ H3) to each chunk's metadata before embedding the chunk text prefixed with its breadcrumb. On the MS MARCO subset this alone contributes +3.1 nDCG@10 over naïve fixed-window chunking with identical embedder and retriever.
Reproduce
git clone https://github.com/samsuljahith/RagForge && cd RagForge
pip install -e ".[all]"
cd benchmarks
python download_msmarco_subset.py
python run_quality.py --frameworks ragforge,langchain,llamaindex,haystack
python run_latency.py --frameworks ragforge,langchain,llamaindex,haystack --runs 1000
python run_memory.py --frameworks ragforge,langchain,llamaindex,haystack
python run_agent_cost.py --scenarios sequential,langgraph,ragforgeNumbers in the tables above are the medians from the last run committed in benchmarks/results/. Re-run on your hardware; PRs with new data points welcome.