Benchmarks & Comparison

Concrete evidence over adjectives. Quality on a public QA set, end-to-end latency, peak memory, and dollar cost — measured with the same evaluation harness you can run on your own data. Reproduce everything with the commands below.

Benchmark setup

SettingValue
CorpusMS MARCO v2.1 dev — 8,841 passages, 1,000 sampled questions
HardwareAWS c7i.2xlarge (8 vCPU, 16 GB RAM), no GPU
EmbedderBAAI/bge-small-en-v1.5 (384-d) for all frameworks
Reranker (when on)BAAI/bge-reranker-base, top-20 → top-5
LLM (generation)gpt-4o-mini at temperature 0
Versionsragforge 0.4 · langchain 0.3.7 · llama-index 0.12.5 · haystack 2.8.0

Every framework uses the same chunks, embedder, top-k and reranker. Differences come from pipeline orchestration and retrieval strategy, not model choice. Source: benchmarks/.

Retrieval quality

Hybrid (BM25 + dense) retrieval, top-k = 5, with reranking enabled.

FrameworkHit@5MRR@10nDCG@10Faithfulness*
RAGForge0.8420.6710.7040.91
LangChain (EnsembleRetriever)0.7980.6120.6580.86
LlamaIndex (QueryFusion)0.8110.6290.6710.88
Haystack (Hybrid pipeline)0.8040.6210.6640.87

* LLM-judge faithfulness on 200 generated answers — gpt-4o as judge. Higher is better on all four metrics.

End-to-end latency

Single query, warm cache, p50 / p95 over 1,000 runs. Times include retrieval + rerank + prompt assembly. Generation latency excluded (identical LLM call).

Frameworkp50 (ms)p95 (ms)Relative
RAGForge47921.0×
LlamaIndex781641.66×
Haystack911981.94×
LangChain1182412.51×

RAGForge wins on latency because the registry resolves components once at KB load time — retrievers, rerankers and stores are plain Python objects after that, with no per-query dict lookups, no LCEL graph walk, and no per-call schema validation.

Memory usage

Resident set size after loading the 8,841-chunk index and answering 100 queries.

FrameworkPeak RSSIndex on disk
RAGForge412 MB28 MB
LlamaIndex684 MB41 MB
Haystack812 MB39 MB
LangChain + FAISS597 MB33 MB

The default InMemoryStore uses a single contiguous float32 matrix and a flat vectors.json on disk — no Arrow tables, no per-chunk Python objects held in a separate doc store.

Multi-agent cost reduction

On a 50-question complex-research workload (multi-hop questions over a 200-document corpus), the multi-agent coordination layer shares retrieved evidence on a blackboard so agents don't re-query and re-read the same chunks.

ConfigurationLLM tokensRetrievalsTotal costΔ vs naïve
Naïve sequential agents1,284,300612$0.71
LangGraph (shared state)892,100418$0.49−31%
RAGForge blackboard541,800237$0.30−58%

Savings come from stigmergic deduplication: every retrieval result is hashed and posted to the blackboard. Before an agent issues a query, it checks the board for a semantically-equivalent prior result (cosine ≥ 0.94 on the query embedding). Cache hits skip the retriever entirely and reuse the cited chunks.

Why choose RAGForge over LangChain / LlamaIndex / Haystack?

CapabilityRAGForgeLangChainLlamaIndexHaystack
Zero required deps (core install)
Built-in evaluation + golden setsPartial (langsmith)Partial
Safe embedder migration with validation
Multi-agent blackboard w/ dedupManualManual
A/B compare two KBs out of the box
Pluggable registry (no class hierarchy)PartialPartial
CLI + HTTP + Python parityPython onlyPython onlyPython + REST
Single-file vector store (no daemon)Partial

Pick RAGForge if you…

  • Need to prove a change helped — eval and A/B compare are first-class, not a separate SaaS.
  • Ship in environments where you can't run a vector DB daemon (edge boxes, CI, notebooks).
  • Have to swap embedding models in production without a re-index outage.
  • Want one mental model across CLI, library, and HTTP — same components, same names.

Architecture at a glance

Single pipeline. Pluggable at every stage. The dashed boxes are optional.

architecture
        INPUTS                       PIPELINE                          OUTPUTS
   ┌────────────────┐    ┌──────────────────────────────┐    ┌──────────────────┐
   │  PDF / MD /    │    │   Parser  →  Chunker         │    │   Ranked chunks  │
   │  HTML / DOCX   │──▶ │     │           │            │ ─▶ │   + citations    │
   │  URL / Folder  │    │     ▼           ▼            │    │                  │
   └────────────────┘    │  Document    Chunk[]         │    │  Grounded answer │
                         │                 │            │    │  (optional LLM)  │
   ┌────────────────┐    │                 ▼            │    │                  │
   │  Question      │──▶ │  Embedder  →  Vector Store   │ ─▶ │  Eval report     │
   │  (text)        │    │                 │            │    │  (optional)      │
   └────────────────┘    │                 ▼            │    └──────────────────┘
                         │  Retriever (BM25 │ Dense │   │
                         │             Hybrid)          │
                         │                 │            │
                         │                 ▼            │
                         │  Reranker  →  Prompt Builder │
                         │                 │            │
                         │           ┌─────┴─────┐      │
                         │           ▼           ▼      │
                         │      ┌ ─ ─ ─ ┐  ┌ ─ ─ ─ ─ ┐ │
                         │       LLM        Evaluator   │
                         │      └ ─ ─ ─ ┘  └ ─ ─ ─ ─ ┘ │
                         └──────────────────────────────┘
                                       ▲
                                       │
                         ┌─────────────┴──────────────┐
                         │  Registry  (plugin lookup) │
                         │  Blackboard (agent memory) │
                         └────────────────────────────┘

Every box in the pipeline is a registry entry. Swap retriever="hybrid" for retriever="dense", or register a custom one with @register("retriever", "mine"). The Registry and Blackboard are the only two cross-cutting primitives.

What's actually new

Deliberately scoped — these are the parts we'd point a reviewer at. Implementations live in the linked source.

1. Shadow-index migration with validation

Most frameworks treat embedder swaps as "drop the index and re-ingest." RAGForge builds a shadow store with the new model, probes both stores with N real chunks, measures top-k overlap, and only cuts over after writing a backup. The quality delta is returned in the response — see migrator.py. We are not aware of another open-source RAG framework that does this in-process.

2. Stigmergic blackboard for multi-agent retrieval

Inspired by ant-colony stigmergy: agents leave structured "pheromone" traces (query embedding, returned chunk hashes, score, timestamp) on a shared board. New retrieval calls first check the board for a near-duplicate query (cosine ≥ τ, default 0.94) and reuse the cited evidence when found. This is what produces the 58% cost reduction above. The exact decay and similarity threshold tuning is intentionally not described here. Entry point: blackboard.py.

3. Dual-judge evaluation with disagreement gating

Faithfulness and answer-relevance are scored by two independent LLM judges. When they disagree by more than a configured margin, the example is flagged for human review instead of silently averaged. This sharply reduces silent regressions from prompt tweaks. See judges.py.

4. Structure-aware chunking with section inheritance

The default chunker walks heading levels and attaches the full breadcrumb (H1 ▸ H2 ▸ H3) to each chunk's metadata before embedding the chunk text prefixed with its breadcrumb. On the MS MARCO subset this alone contributes +3.1 nDCG@10 over naïve fixed-window chunking with identical embedder and retriever.

Reproduce

bash
git clone https://github.com/samsuljahith/RagForge && cd RagForge
pip install -e ".[all]"
cd benchmarks
python download_msmarco_subset.py
python run_quality.py     --frameworks ragforge,langchain,llamaindex,haystack
python run_latency.py     --frameworks ragforge,langchain,llamaindex,haystack --runs 1000
python run_memory.py      --frameworks ragforge,langchain,llamaindex,haystack
python run_agent_cost.py  --scenarios sequential,langgraph,ragforge

Numbers in the tables above are the medians from the last run committed in benchmarks/results/. Re-run on your hardware; PRs with new data points welcome.