Vectorless RAG with PageIndex: Can an LLM Replace Your Embedding Model?
A side-by-side comparison of tree-based LLM retrieval against three vector embedding providers, evaluated on the same corpus and queries from our earlier benchmark.
The conventional RAG pipeline is well understood: chunk your documents, embed them into vectors, store them in a vector database, and at query time embed the query and find the nearest neighbors. It works. But the approach rests on an assumption that is rarely questioned: that the best way to find relevant text is to compress it into a fixed-dimension vector and measure cosine distance.
What if you skipped the vectors entirely?
PageIndex takes a fundamentally different approach. Instead of embedding chunks into vector space, it uses an LLM to build a hierarchical tree index of each document – titles, summaries, and structural relationships. At query time, another LLM call reasons over the tree structure to identify which sections are relevant. No embeddings. No vector database. Just structured text and language model reasoning.
We integrated PageIndex into our existing multi-provider RAG system and ran it against the same test methodology from our embedding comparison. This article presents what happened.
Note: All results come from a real evaluation run. PageIndex used Qwen3-32B via Ollama for both indexing and retrieval. Vector providers used the same configuration as our previous benchmark.
The Setup
How PageIndex Works
The PageIndex pipeline differs from vector RAG at every stage:
Vector RAG: Document --> Chunk --> Embed --> Store in Milvus --> Query Embed --> Cosine Search
PageIndex: Document --> LLM Tree Build --> Store as JSON --> LLM Tree Search --> Extract TextIngestion: For each document, PageIndex makes multiple LLM calls to construct a hierarchical tree. Each node in the tree gets a title, a summary, and the original text. PDFs are processed natively; other formats are first converted to Markdown via Docling, then fed to PageIndex's md_to_tree() function.
Retrieval: Given a query, PageIndex iterates over every indexed document. For each document, it strips the text from the tree (keeping only titles and summaries), presents the structure to the LLM with the query, and asks: "which nodes are relevant?" The LLM reasons over the tree and returns a list of node IDs. The system then extracts the full text from those nodes.
# The tree search prompt (simplified)
"""You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a summary.
Find all nodes likely to contain the answer.
Question: {query}
Document tree: {tree_without_text}
Reply as JSON: {"thinking": "...", "node_list": ["node_id_1", ...]}"""This is conceptually elegant: the LLM understands the document's structure and can reason about relevance at a semantic level that no distance metric can match. But it comes with a cost we will quantify shortly.
Integration
We added PageIndex as a fourth retrieval provider alongside our existing three:
| Provider | Model | Approach | Index Storage |
|---|---|---|---|
| Qwen3-Embedding-0.6B | Vector (Ollama API) | Milvus | |
| pplx-embed-v1-0.6B | Vector (SentenceTransformers) | Milvus | |
| pplx-embed-context-v1-0.6B | Vector (contextualized) | Milvus | |
| Qwen3-32B (via Ollama) | LLM tree search | JSON on disk |
The key architectural difference: vector providers embed documents once and search by vector similarity. PageIndex makes LLM calls at both ingest time (to build the tree) and query time (to search it). The system routes queries accordingly – vector providers go through Milvus, PageIndex queries go through sequential LLM tree search across all indexed documents.
All four providers share the same upload, delete, list, and health endpoints. The pipeline determines which path to take based on the provider name.
Technical Details
Self-hosted PageIndex uses OpenAI-compatible API calls internally. Since we run everything through Ollama, we monkey-patch the OpenAI client to route through Ollama's endpoint:
def _patch_openai_for_ollama():
import openai
original_init = openai.OpenAI.__init__
def _patched_init(self, **kwargs):
kwargs["base_url"] = "http://localhost:11434/v1"
kwargs.setdefault("api_key", "ollama")
original_init(self, **kwargs)
openai.OpenAI.__init__ = _patched_initPageIndex also uses tiktoken for token counting, which only knows OpenAI model names. We patch it to fall back gracefully:
import tiktoken
_original = tiktoken.encoding_for_model
def _patched(model_name):
try:
return _original(model_name)
except KeyError:
return tiktoken.get_encoding("cl100k_base")
tiktoken.encoding_for_model = _patchedTest Methodology
We reused the same 16 queries from our embedding comparison and sampled a fresh corpus of 18 documents (2 per topic across 9 categories), mixing file formats:
| Dimension | Breakdown |
|---|---|
| Documents | 18 total (2 per topic) |
| File types | 12 Markdown, 2 DOCX, 2 PPTX, 2 PDF |
| Topics | business, entertainment, health, opinion, politics, science, sports, technology, world |
| Source size | 0.78 MB |
The corpus is deliberately smaller than our 72-document embedding comparison. This was not arbitrary – it was a hard constraint imposed by PageIndex's architecture, as we will explain in the results.
Chunking vs. Tree Building
Vector providers used the same HybridChunker pipeline as before (512-token chunks, Qwen3 tokenizer, merge_peers=True), producing 114 chunks across the 18 documents.
PageIndex built tree indices for the same documents, producing 85 nodes total. The difference is structural: PageIndex creates hierarchical nodes (title, section, subsection) rather than fixed-size text windows. A single document might produce 2 nodes (title + body) or 21 nodes (deeply structured content), depending entirely on the document's internal organization.
Results
Upload Performance
All 18 documents were uploaded to all four providers in a single request per document. Vector embedding was fast; PageIndex tree building was not.
| Provider | Total Ingest Time | Avg per Document | Method |
|---|---|---|---|
| Vector (all 3) | ~30s combined | ~1.7s | Embed + Milvus insert |
| PageIndex | ~20 min total | ~67s | Multiple LLM calls per document |
The variation in PageIndex ingest time was extreme. A 0.9 KB Markdown file about US troop withdrawal took 7 seconds total (including vector embedding). A 15 KB science roundup article took 459 seconds – almost 8 minutes – because PageIndex made many LLM calls to build a 16-node tree. A 446 KB PPTX file, paradoxically, took only 7 seconds because it produced simple content that resolved quickly.
The pattern: PageIndex ingest time scales with document complexity, not file size. A short but structurally rich document takes far longer than a large but simple one.
Storage Comparison
| Provider | Index Type | Chunks/Nodes | Storage | Per Entry |
|---|---|---|---|---|
| Milvus vectors | 114 | 0.564 MB | 5,191 bytes | |
| Milvus vectors | 114 | 0.564 MB | 5,191 bytes | |
| Milvus vectors | 114 | 0.564 MB | 5,191 bytes | |
| JSON on disk | 85 | 0.130 MB | 1,606 bytes |
PageIndex uses 77% less storage per entry than vector providers. This makes sense: vector entries store a 1024-dim float32 vector (4,096 bytes) alongside the text, while PageIndex stores only structured text (titles, summaries, content) as JSON. No vectors means no vector storage overhead.
Query Latency
This is where the fundamental architectural trade-off becomes undeniable.
| Provider | Avg Latency | Min | Max | Completed |
|---|---|---|---|---|
| 48 ms | 28 ms | 121 ms | 16/16 | |
| 42 ms | 25 ms | 106 ms | 16/16 | |
| 17,864 ms | 4,625 ms | 53,743 ms | 16/16 | |
| 575,288 ms (~9.6 min) | 545,340 ms | 589,666 ms | 7/16 |
Read that again: PageIndex averaged 9.6 minutes per query. Nine of sixteen queries timed out at the 10-minute limit.
The reason is architectural. Vector search is O(1) with respect to corpus size – the Milvus index handles similarity computation regardless of how many documents exist. PageIndex search is O(N): it must make a separate LLM call for every indexed document to determine if that document is relevant. With 18 documents and ~30 seconds per LLM call, each query takes approximately 18 × 30 = 540 seconds.
This scaling characteristic is not a bug or an implementation issue. It is fundamental to how tree-based retrieval works. Every document must be individually evaluated because there is no shared index structure that allows ruling out documents without reading them.
Retrieval Quality
Of the 7 queries that PageIndex completed, the results were striking:
| Query | ollama | pplx | pplx-ctx | pageindex | PageIndex Result |
|---|---|---|---|---|---|
| AI military risks | 0.543 | 0.219 | 0.204 | 1.000 | Found science roundup (relevant section) |
| iPad Air M4 chip | 0.531 | 0.149 | 0.239 | 0.000 | Correctly returned nothing |
| Clouds in Antarctica | 0.490 | 0.185 | 0.224 | 0.000 | Correctly returned nothing |
| Ukraine war drones | 0.486 | 0.117 | 0.201 | 0.000 | Correctly returned nothing |
| Iran threats | 0.498 | 0.190 | 0.191 | 0.000 | Correctly returned nothing |
| MLS rule changes | 0.515 | 0.157 | 0.167 | 0.000 | Correctly returned nothing |
| Warner Bros sale | 0.562 | 0.178 | 0.227 | 0.000 | Correctly returned nothing |
PageIndex found relevant content for exactly 1 of 7 completed queries: "AI military risks." For that query, it identified a section in a science news roundup article that discussed military-adjacent technology topics. The LLM reasoned through the tree structure and picked the correct section. Score: 1.0 (our synthetic scoring for rank-based results).
For the other 6 completed queries, PageIndex returned zero results. This is the critical behavioral difference: when no document in the corpus matches the query, PageIndex says so. It does not return the "least irrelevant" document.
Compare this with the vector providers. For "Iran threats," ollama confidently returned an article about Snoop Dogg performing in a Welsh canteen (score: 0.498). For "MLS rule changes," it returned an EPA greenhouse gas article (score: 0.515). These are not just wrong – they are plausibly scored wrong answers that a downstream LLM would treat as legitimate context.
Analysis
The Precision-Recall Trade-off
PageIndex and vector RAG occupy opposite ends of the precision-recall spectrum:
Vector RAG: Always returns something. High recall (every query gets results), but variable precision (those results may be irrelevant, especially with keyword-sensitive models like Qwen3). The system never admits "I don't have anything relevant."
PageIndex: Returns results only when the LLM judges them relevant. High precision (when it returns something, it is almost certainly relevant), but lower recall (it may miss tangentially related content, and 56% of our queries timed out before completing).
Neither behavior is inherently better. If your application requires an answer for every query – even a best-effort one – vector RAG is the only viable choice. If your application would rather say "no relevant results" than surface wrong context, PageIndex's selectivity is a feature.
The Scaling Wall
The 9.6-minute average query time with just 18 documents reveals a hard scaling limit. Extrapolating linearly:
| Corpus Size | Estimated Query Time |
|---|---|
| 18 documents | ~9.6 minutes |
| 72 documents | ~38 minutes |
| 500 documents | ~4.4 hours |
| 10,000 documents | ~3.5 days |
Vector search, by contrast, handles millions of documents with sub-second latency. The IVF_FLAT index we use in Milvus adds negligible overhead per query regardless of corpus size.
This does not mean PageIndex is useless. It means PageIndex is a document-level analysis tool, not a corpus-level search engine. It excels when you have a small, curated set of documents and need the LLM to deeply understand their structure. Think: analyzing a specific contract, reasoning over a technical specification, or navigating a complex regulatory document. It is not designed for the "search 10,000 news articles" use case that vector RAG handles trivially.
What PageIndex Gets Right
Despite the scaling limitations, the one successful result was genuinely impressive. For "AI military risks," the LLM:
- Examined a science news roundup with 16 hierarchical nodes
- Read each node's title and summary
- Identified that the "Also in science news this week" section contained references to military-adjacent technology
- Returned that specific section
This is reasoning, not pattern matching. No embedding model would find this connection through cosine similarity alone – the section's text does not contain the words "military" or "risks" in close proximity. The LLM understood the semantic relationship between the query and the content at a level that vector similarity cannot reach.
The vector providers, meanwhile, returned the "Who will pay for the trillion-dollar AI boom?" article – a keyword match on "AI" that misses the "military risks" specificity entirely.
The False Negative Problem
PageIndex returned zero results for 6 of 7 completed queries. In some cases this was correct (no document about "iPad Air M4 chip" existed in the corpus). But for queries like "Warner Bros sale" – where the opinion article about EPA policy was clearly not relevant, but other documents might have had tangentially useful content – the LLM's strict relevance threshold may have filtered out useful-but-imperfect matches.
This is the mirror image of vector RAG's false positive problem. Vector models return "Snoop Dogg in a Welsh canteen" for "Iran threats" (false positive). PageIndex returns nothing for queries where loosely related content exists (false negative). Both failure modes are dangerous in different ways.
Practical Recommendations
When to Use PageIndex
- Small, curated document sets (under 20 documents) where deep structural understanding matters
- High-stakes queries where returning wrong context is worse than returning nothing
- Document navigation – helping users find specific sections within a known document
- Hybrid pipelines – use vector RAG for initial retrieval, then PageIndex for deep analysis of the top-k candidates
When to Use Vector RAG
- Any corpus larger than ~50 documents – the O(N) query scaling makes PageIndex impractical
- Low-latency requirements – vector search returns in milliseconds, not minutes
- Best-effort retrieval – when users expect results for every query, even approximate ones
- Production systems – the deterministic, sub-second query time is operationally predictable
The Hybrid Approach
The most promising direction is combining both approaches. Use vector RAG to narrow 10,000 documents to 5 candidates in 50 milliseconds, then use PageIndex to deeply reason over those 5 candidates in 2-3 minutes. You get the scaling of vectors with the precision of LLM reasoning.
Our system already supports this architecturally – the /compare endpoint runs all providers in parallel, and the per-provider routing makes it straightforward to build a two-stage pipeline.
Conclusion
PageIndex represents a genuinely different philosophy of information retrieval: replace approximate similarity matching with explicit LLM reasoning. Our evaluation shows that this philosophy has real strengths (precision, selectivity, structural understanding) and real limitations (O(N) latency, impractical for large corpora, heavy compute requirements).
The key findings:
- PageIndex query time scales linearly with corpus size. At 18 documents, queries averaged 9.6 minutes. This is architectural, not fixable with better hardware or optimization.
- When PageIndex finds something, it is right. Its single successful retrieval was more semantically accurate than any vector provider's result for the same query.
- When nothing is relevant, PageIndex says so.Vector providers always return something; PageIndex exercises judgment. Whether this is a feature or a limitation depends on your application.
- Storage is dramatically lower. 77% less per entry than vector providers, because there are no float32 vectors to store.
- Ingest time scales with document complexity, not file size. A structurally rich 15 KB article took 8 minutes; a simple 446 KB PPTX took 7 seconds.
The embedding model comparison is now a three-way conversation: fast-and-approximate (vector RAG), slow-and-precise (PageIndex), or – most likely for production systems – a hybrid of both. The architecture we have built supports all three approaches through the same API, making it straightforward to evaluate which combination works best for your specific documents and queries.
Vector RAG is not going away. But PageIndex demonstrates that there is more to retrieval than cosine distance, and that LLMs can contribute to the search process itself, not just the generation step that follows it.
Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3-32B via Ollama (PageIndex), Milvus Standalone (vector providers), 18-document news corpus (MD/DOCX/PPTX/PDF), 16 queries across 9 topic categories. PageIndex completed 7 of 16 queries within the 10-minute timeout.
Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.