So You Think You Need RAG? You Probably Don’t...

So You Think You Need RAG? You Probably Don’t...
RAG vs Old-School BM25

The Retrieval-Augmented Generation (RAG) boom is dazzling teams who want “ChatGPT-like” answers over their own data. Yet for many business and agentic workloads the old work-horse of lexical (keyword) search still beats RAG on speed, cost and engineering simplicity. This article explains why.

RAG: Why All the Hype?

RAG marries a neural vector search layer with a large language model (LLM):

  1. Source texts are converted into high-dimensional vectors (embeddings).
  2. A similarity search fetches the k closest vectors for a question.
  3. The LLM reads those passages in its prompt and generates an answer.

Because dense vectors capture semantic similarity, RAG handles paraphrases (“car hire” ≈ “vehicle rental”) and mitigates hallucination by grounding the model in real passages.

That semantic edge is powerful for open-ended Q&A over heterogeneous content. It also fuels a booming ecosystem of vector databases, GPU appliances and hosted embedding services.

Hidden Trade-offs

  1. Storage blow-up. Each 1 kB chunk becomes a 1 536-float vector (≈ 6 kB) plus index structures. Vendors warn of 2-10× raw-data expansion in practice.
  2. Memory residency. Graph-based ANN structures (HNSW) must sit in RAM for millisecond latency, binding capacity to DRAM not cheap disk.
  3. Complex pipelines. You maintain ingestion, chunking, embedding refresh, relevance tuning and prompt engineering—work that often eclipses the LLM call itself.

Keyword Search Never Left

A classical inverted index maps each term to a compressed posting list of document IDs. The engineering lessons of over two decades of web search make it a remarkably efficient foundation for agentic tasks:

Feature Lexical Index Typical RAG Stack
Disk footprint 3-10% of source text after modern compression; some research corpora hit ~6% ≥ 200% once vectors and ANN graphs are stored
RAM needed for query Bytes per posting (often ≤ 2%) loaded on demand Entire vector graph kept hot in memory
Query latency Sub-10 ms for AND/OR/WAND across millions of docs on commodity CPUs Tens of ms ANN + 100-ms LLM generation
Explainability Exact term positions; deterministic scoring (BM25) Approximate similarity + stochastic LLM output

Software agents iteratively ask focused, structured questions (facts, IDs, configuration values). They are not hindered by keyword syntax or result pages—a limitation that drove humans to semantic tools. Agents:

  • Issue many rapid micro-queries—low latency matters.
  • Prefer deterministic, replayable results.
  • Can chain queries (“drill down”) instead of demanding one perfect paragraph.

For such automated workflows, a tuned BM25 or term-proximity scorer returns pinpoint passages faster and cheaper than spinning GPUs for every step.

History Lesson: Google’s Three Ages

  1. 1998-2004 – Keyword Era. Google won on a clean UI and a lightning-fast inverted index spanning 26 million pages at launch.
  2. 2004-2012 – PageRank Age. As the web ballooned past a billion pages, patience for paging through results faded. Ranking links by global citation (PageRank) surfaced the “best” answer on the first screen.
  3. 2012-present – Knowledge Graph & ML. Graph-derived entities and BERT-style re-rankers pushed doc accuracy still higher and powered zero-click answers.

Observation: Each leap forward augmented the lexical core; it never replaced it. Even the latest MUM/BERT stacks run atop the same compressed postings lists that cost a fraction of the data they index.

Storage Math: 2% vs 200%+

Lexical Index

Modern gap-encoded, SIMD-decoded postings achieve 5-8 bits per document pointer. On a 1 GB text corpus:

  • Raw tokens: 1 000 MB
  • Compressed postings: ~60 MB (≈ 6%)
  • Dictionary & term stats: ~10 MB
  • Total ≈ 70 MB → < 7%.
    Projects that drop positions or store only IDs compress further, dipping near 2% for log or metric datasets.

RAG Pipeline

Take the same 1 GB corpus, chunked into 500-byte passages:

  • 2 000 chunks × 1 536-dim float × 4 B ≈ 12 MB (vectors)
  • HNSW graph links (~64% vector size)
  • Text copy for LLM prompts (~1 GB or external store)
  • Total ≥ 2 GB → ≥ 200% of source size, before replicas for availability.

Thus, RAG’s memory & storage bill can be two orders of magnitude higher than a lean lexical index.

When You Do Need RAG

Semantic retrieval shines when:

  1. Queries are long, fuzzy or multilingual.
  2. You lack domain-specific synonyms to curate.
  3. Users demand natural-language answers, not SQL-like filters.
  4. You must fuse modalities (image + text vectors).

If those constraints define your product, budget for vector infra and treat RAG as a premium feature atop (not instead of) a lexical baseline.

Practical Hybrid Recipe

Many teams get the best of both worlds:

  1. Primary BM25 index for ≤ 95% of agent queries.
  2. Light vector side-car on just the long-tail content that lexical fails (e.g., support tickets).
  3. Fall-through prompt: Only when BM25 score < τ do you pay for embeddings + LLM.

This design keeps the hot path CPU-only while containing GPU and storage costs.

Takeaways

  • Start with lexical search. It’s battle-tested, tiny (< 10% storage) and screaming fast on CPUs.
  • Add semantic layers surgically. Vector search is powerful but heavy (≥ 200% storage) and operationally complex.
  • Remember Google’s lesson. Keyword → PageRank → Knowledge Graph were additive; the inverted index still powers every query.
  • Agents don’t need pretty results, they need reliable ones. Deterministic BM25 + filters often beat probabilistic embeddings for autonomous workflows.

So before you rush to embed every sentence and spin up a GPU fleet, ask yourself: do you really need RAG?
For many workloads the answer is still—probably not.