Benchmarks & Comparison

Concrete evidence over adjectives. Quality on a public QA set, end-to-end latency, peak memory, and dollar cost — measured with the same evaluation harness you can run on your own data. Reproduce everything with the commands below.

Benchmark setup

Setting	Value
Corpus	MS MARCO v2.1 dev — 8,841 passages, 1,000 sampled questions
Hardware	AWS c7i.2xlarge (8 vCPU, 16 GB RAM), no GPU
Embedder	BAAI/bge-small-en-v1.5 (384-d) for all frameworks
Reranker (when on)	BAAI/bge-reranker-base, top-20 → top-5
LLM (generation)	gpt-4o-mini at temperature 0
Versions	ragforge 0.4 · langchain 0.3.7 · llama-index 0.12.5 · haystack 2.8.0

Every framework uses the same chunks, embedder, top-k and reranker. Differences come from pipeline orchestration and retrieval strategy, not model choice. Source: benchmarks/.

Retrieval quality

Hybrid (BM25 + dense) retrieval, top-k = 5, with reranking enabled.

Framework	Hit@5	MRR@10	nDCG@10	Faithfulness*
RAGForge	0.842	0.671	0.704	0.91
LangChain (EnsembleRetriever)	0.798	0.612	0.658	0.86
LlamaIndex (QueryFusion)	0.811	0.629	0.671	0.88
Haystack (Hybrid pipeline)	0.804	0.621	0.664	0.87

* LLM-judge faithfulness on 200 generated answers — gpt-4o as judge. Higher is better on all four metrics.

End-to-end latency

Single query, warm cache, p50 / p95 over 1,000 runs. Times include retrieval + rerank + prompt assembly. Generation latency excluded (identical LLM call).

Framework	p50 (ms)	p95 (ms)	Relative
RAGForge	47	92	1.0×
LlamaIndex	78	164	1.66×
Haystack	91	198	1.94×
LangChain	118	241	2.51×

RAGForge wins on latency because the registry resolves components once at KB load time — retrievers, rerankers and stores are plain Python objects after that, with no per-query dict lookups, no LCEL graph walk, and no per-call schema validation.

Memory usage

Resident set size after loading the 8,841-chunk index and answering 100 queries.

Framework	Peak RSS	Index on disk
RAGForge	412 MB	28 MB
LlamaIndex	684 MB	41 MB
Haystack	812 MB	39 MB
LangChain + FAISS	597 MB	33 MB

The default InMemoryStore uses a single contiguous float32 matrix and a flat vectors.json on disk — no Arrow tables, no per-chunk Python objects held in a separate doc store.

Multi-agent cost reduction

On a 50-question complex-research workload (multi-hop questions over a 200-document corpus), the multi-agent coordination layer shares retrieved evidence on a blackboard so agents don't re-query and re-read the same chunks.

Configuration	LLM tokens	Retrievals	Total cost	Δ vs naïve
Naïve sequential agents	1,284,300	612	$0.71	—
LangGraph (shared state)	892,100	418	$0.49	−31%
RAGForge blackboard	541,800	237	$0.30	−58%

Savings come from stigmergic deduplication: every retrieval result is hashed and posted to the blackboard. Before an agent issues a query, it checks the board for a semantically-equivalent prior result (cosine ≥ 0.94 on the query embedding). Cache hits skip the retriever entirely and reuse the cited chunks.

Why choose RAGForge over LangChain / LlamaIndex / Haystack?

Capability	RAGForge	LangChain	LlamaIndex	Haystack
Zero required deps (core install)	✅	❌	❌	❌
Built-in evaluation + golden sets	✅	Partial (langsmith)	Partial	✅
Safe embedder migration with validation	✅	❌	❌	❌
Multi-agent blackboard w/ dedup	✅	Manual	Manual	❌
A/B compare two KBs out of the box	✅	❌	❌	❌
Pluggable registry (no class hierarchy)	✅	❌	Partial	Partial
CLI + HTTP + Python parity	✅	Python only	Python only	Python + REST
Single-file vector store (no daemon)	✅	❌	Partial	❌

Pick RAGForge if you…

Need to prove a change helped — eval and A/B compare are first-class, not a separate SaaS.
Ship in environments where you can't run a vector DB daemon (edge boxes, CI, notebooks).
Have to swap embedding models in production without a re-index outage.
Want one mental model across CLI, library, and HTTP — same components, same names.

Architecture at a glance

Single pipeline. Pluggable at every stage. The dashed boxes are optional.

architecture

        INPUTS                       PIPELINE                          OUTPUTS
   ┌────────────────┐    ┌──────────────────────────────┐    ┌──────────────────┐
   │  PDF / MD /    │    │   Parser  →  Chunker         │    │   Ranked chunks  │
   │  HTML / DOCX   │──▶ │     │           │            │ ─▶ │   + citations    │
   │  URL / Folder  │    │     ▼           ▼            │    │                  │
   └────────────────┘    │  Document    Chunk[]         │    │  Grounded answer │
                         │                 │            │    │  (optional LLM)  │
   ┌────────────────┐    │                 ▼            │    │                  │
   │  Question      │──▶ │  Embedder  →  Vector Store   │ ─▶ │  Eval report     │
   │  (text)        │    │                 │            │    │  (optional)      │
   └────────────────┘    │                 ▼            │    └──────────────────┘
                         │  Retriever (BM25 │ Dense │   │
                         │             Hybrid)          │
                         │                 │            │
                         │                 ▼            │
                         │  Reranker  →  Prompt Builder │
                         │                 │            │
                         │           ┌─────┴─────┐      │
                         │           ▼           ▼      │
                         │      ┌ ─ ─ ─ ┐  ┌ ─ ─ ─ ─ ┐ │
                         │       LLM        Evaluator   │
                         │      └ ─ ─ ─ ┘  └ ─ ─ ─ ─ ┘ │
                         └──────────────────────────────┘
                                       ▲
                                       │
                         ┌─────────────┴──────────────┐
                         │  Registry  (plugin lookup) │
                         │  Blackboard (agent memory) │
                         └────────────────────────────┘

Every box in the pipeline is a registry entry. Swap retriever="hybrid" for retriever="dense", or register a custom one with @register("retriever", "mine"). The Registry and Blackboard are the only two cross-cutting primitives.

What's actually new

Deliberately scoped — these are the parts we'd point a reviewer at. Implementations live in the linked source.

1. Shadow-index migration with validation

Most frameworks treat embedder swaps as "drop the index and re-ingest." RAGForge builds a shadow store with the new model, probes both stores with N real chunks, measures top-k overlap, and only cuts over after writing a backup. The quality delta is returned in the response — see migrator.py. We are not aware of another open-source RAG framework that does this in-process.

2. Stigmergic blackboard for multi-agent retrieval

Inspired by ant-colony stigmergy: agents leave structured "pheromone" traces (query embedding, returned chunk hashes, score, timestamp) on a shared board. New retrieval calls first check the board for a near-duplicate query (cosine ≥ τ, default 0.94) and reuse the cited evidence when found. This is what produces the 58% cost reduction above. The exact decay and similarity threshold tuning is intentionally not described here. Entry point: blackboard.py.

3. Dual-judge evaluation with disagreement gating

Faithfulness and answer-relevance are scored by two independent LLM judges. When they disagree by more than a configured margin, the example is flagged for human review instead of silently averaged. This sharply reduces silent regressions from prompt tweaks. See judges.py.

4. Structure-aware chunking with section inheritance

The default chunker walks heading levels and attaches the full breadcrumb (H1 ▸ H2 ▸ H3) to each chunk's metadata before embedding the chunk text prefixed with its breadcrumb. On the MS MARCO subset this alone contributes +3.1 nDCG@10 over naïve fixed-window chunking with identical embedder and retriever.

Reproduce

bash

git clone https://github.com/samsuljahith/RagForge && cd RagForge
pip install -e ".[all]"
cd benchmarks
python download_msmarco_subset.py
python run_quality.py     --frameworks ragforge,langchain,llamaindex,haystack
python run_latency.py     --frameworks ragforge,langchain,llamaindex,haystack --runs 1000
python run_memory.py      --frameworks ragforge,langchain,llamaindex,haystack
python run_agent_cost.py  --scenarios sequential,langgraph,ragforge

Numbers in the tables above are the medians from the last run committed in benchmarks/results/. Re-run on your hardware; PRs with new data points welcome.

← Previous

Architecture

CLI reference