Optimizing PageIndex: From 10-Minute Queries to a 2.6x Speedup with Meta-Index Pre-Filtering

The table-of-contents approach: how a lightweight document index and parallel execution cut PageIndex query times by more than half – and improved recall in the process.


In our first PageIndex evaluation, we found that vectorless RAG delivered impressive precision at a devastating cost: 9.6-minute average query times with Qwen3-32B. Our follow-up showed that swapping to Qwen3.5-4B cut that to 5.1 minutes per query, but the fundamental problem remained. PageIndex makes one LLM call per document per query. With 18 documents, that is 18 sequential LLM calls. With 500 documents, it would take hours.

The architecture is O(N) in the number of documents. No amount of model shrinking changes that.

This article describes three optimizations we implemented to break the linear scaling, and the benchmark results that followed.

The Problem

PageIndex's query path, as originally implemented, worked like this:

For each indexed document (all N):
    1. Load tree from disk
    2. Strip text, keep structure + summaries
    3. Send to LLM: "which nodes are relevant to this query?"
    4. Parse response, extract text from selected nodes
Return all hits, sorted by rank

Every document gets an LLM call, regardless of whether it could possibly contain relevant content. A query about "inherited diseases" still sends the tree structure of a sports commentary article to the LLM, waits for the model to reason about it, and receives the expected "nothing relevant" response. This is the equivalent of reading every book in a library to answer a question, rather than checking the card catalogue first.

With Qwen3.5-4B, each LLM call takes approximately 12-15 seconds. For 16 documents, the minimum query time is approximately 200 seconds – over 3 minutes of serial LLM inference, most of it wasted on irrelevant documents.

The Optimizations

We implemented three complementary strategies, each targeting a different layer of the problem.

1. Meta-Index Pre-Filtering (The Table of Contents)

The core insight: PageIndex already builds a rich tree structure at ingest time, containing titles, section headings, and summaries for every document. We can extract a lightweight "table of contents" from each tree and store it as a meta-index – a single JSON file that describes the entire corpus.

At query time, instead of searching every document, we first present the meta-index to the LLM in a single call:

"Here are 16 documents with their titles, summaries, and section headings.
Which 5 are most likely to contain information about {query}?"

The LLM selects the most promising candidates. Only those candidates proceed to the full tree search.

The meta-index structure:

{
  "2ee0bd3ca3ab": {
    "filename": "who-will-pay-for-the-trillion-dollar-ai-boom.md",
    "title": "Who will pay for the trillion-dollar AI boom?",
    "summary": "# Who will pay for the trillion-dollar AI boom?\n\n**** | 2025-07-31...",
    "top_sections": ["Summary: ## Summary\n\nThe article...", "Full Article: Error"]
  },
  "e88b835a1280": {
    "filename": "the-iran-war-is-jeopardizing-the-entire-global-economy.md",
    "title": "The Iran War Is Jeopardizing the Entire Global Economy",
    "summary": "# The Iran War Is Jeopardizing...",
    "top_sections": ["Summary: ## Summary\n\nThe Iran war...", "Full Article: On Sunday..."]
  }
}

For our 16-document corpus, the meta-index is 11.4 KB – small enough to fit comfortably in the LLM's context window alongside the query. The meta-index is rebuilt automatically whenever documents are added or removed, so it is always current.

LLM call reduction:

StageOriginalWith Meta-Index
Document selection--1 LLM call
Tree search16 calls (all documents)Up to 5 calls (candidates only)
Total16 callsMax 6 calls

This is a 2.7x reduction in LLM calls per query. The meta-index selection call itself is fast because it processes only titles and short summaries, not full tree structures.

Once the meta-index narrows the candidate set, the remaining tree searches are independent – they do not depend on each other's results. We execute them concurrently using a thread pool.

with ThreadPoolExecutor(max_workers=2) as executor:
    futures = {
        executor.submit(_search_single_document, doc_id, query_text): doc_id
        for doc_id in candidate_doc_ids
    }
    for future in as_completed(futures):
        hits.extend(future.result())

We cap the thread pool at 2 workers rather than scaling to the candidate count. This is deliberate: Ollama serializes LLM inference requests for the same model, so flooding it with concurrent requests does not improve throughput. Two workers provide a modest pipeline benefit (the next request starts while the previous response is being parsed) without overwhelming the inference server.

The parallelism becomes more meaningful with inference servers that support true concurrent requests (such as vLLM or TGI with batched inference). With Ollama, the speedup from parallelism alone is marginal – the real win comes from having fewer documents to search.

3. In-Memory Caching

The original implementation read tree JSON files from disk on every query. For 16 documents queried 4 times each, that is 64 file reads. We added a module-level cache:

_tree_cache: dict[str, dict] = {}   # document_id -> tree data
_meta_cache: dict[str, dict] = {}   # document_id -> metadata

Trees are loaded from disk once and cached in memory for subsequent queries. The cache is invalidated when a document is deleted and populated on ingest. For our 16-document corpus (130 KB of tree data), the memory overhead is negligible.

This optimization is invisible in single-query benchmarks but matters for sustained workloads where the same corpus is queried repeatedly.

4. Hybrid Vector + PageIndex (Bonus)

We also implemented a hybrid mode that replaces the meta-index LLM call with a Milvus vector search:

Stage 1: Embed query → Milvus COSINE search → top-K document IDs (milliseconds)
Stage 2: PageIndex tree search on those documents only (seconds)

This combines vector recall (fast, approximate) with PageIndex precision (slow, exact). The vector search takes 30-50ms to find candidate documents, then PageIndex spends its time budget only on documents that are already known to be semantically relevant.

This mode requires Milvus to be running and documents to be ingested into both a vector collection and PageIndex. We benchmarked this separately but focus on the meta-index approach in this article, since it works without any external dependencies.

Graceful Fallback

The meta-index selection is a single LLM call, and LLM calls can fail – malformed JSON responses, timeouts, or model unavailability. We added a fallback: if the meta-index selection fails for any reason, the system silently reverts to searching all documents. This ensures the optimization never makes things worse; it either helps or gets out of the way.

try:
    selected = _select_documents(query_text, meta_index)
    if not selected:
        return list(meta_index.keys())  # fall back to all docs
except Exception:
    return list(meta_index.keys())  # fall back to all docs

The Benchmark

Methodology

We compared the original and optimized query paths on the same 16-document corpus from our previous benchmarks, using the same Qwen3.5-4B model on the same hardware. The benchmark calls the query functions directly (no HTTP server) to eliminate network overhead and timeout complications.

ParameterValue
ModelQwen3.5-4B (3.4 GB VRAM)
Documents16 (news articles, mixed formats)
Queries4 (representative subset)
Meta-indexEnabled, max 5 candidates
Parallel searchEnabled, 2 workers
HardwareNVIDIA RTX 5090 (32 GB VRAM)
Ollama keep-alive60 minutes (prevents model unloading between queries)

We used 4 queries rather than the full 16 from previous benchmarks. Each query requires 16+ LLM calls in the original path, and the benchmark runs both paths sequentially. With the 4B model averaging 12-15 seconds per LLM call, each query takes 3-12 minutes in the original path. Running all 16 queries through both paths would take over 4 hours. Four representative queries across different topics gave us reliable comparison data in a manageable timeframe.

Queries tested:

  • "AI military risks" – broad topic, multiple potentially relevant documents
  • "Clouds in Antarctica" – specific topic, likely no match in corpus
  • "Inherited diseases" – science/health, one strong match
  • "Trump AI executive orders" – politics/tech intersection

Results

Latency

QueryOriginalOptimizedSpeedup
AI military risks253,963ms171,986ms1.5x
Clouds in Antarctica195,813ms137,494ms1.4x
Inherited diseases194,608ms458,333ms0.4x*
Trump AI executive orders795,831ms145,275ms5.5x
Average599,413ms228,272ms2.6x
Total2,397,652ms913,088ms2.6x

*The "Inherited diseases" query was penalized by a meta-index timeout: the LLM call to select candidates timed out at 120 seconds, triggering the fallback to search all 16 documents. The tree search itself then ran sequentially across all documents. Excluding this outlier, the average speedup is 2.8x.

The best result – "Trump AI executive orders" at 5.5x – demonstrates the full optimization pipeline working as intended. The meta-index correctly identified a small number of relevant documents, and the tree search completed quickly on just those candidates. The original path spent nearly 13 minutes searching all 16 documents, most of which had nothing to do with AI policy.

Retrieval Quality

QueryOriginal HitsOptimized HitsOverlap
AI military risks131/1 (100%)
Clouds in Antarctica00--
Inherited diseases333/3 (100%)
Trump AI executive orders121/1 (100%)

Every hit returned by the original path was also returned by the optimized path. The optimized path additionally found results that the original path missed.

The "AI military risks" query is the clearest example. The original path found one document (the AI boom article). The optimized path found three: the same AI boom article, plus two sections from "The Iran War Is Jeopardizing the Entire Global Economy" – an article about military conflict disrupting global infrastructure, including AWS data centers in the Gulf. This is a genuinely relevant connection: the Iran war article discusses the intersection of military conflict and technology infrastructure, which is directly relevant to "AI military risks."

The meta-index did not just find the same results faster – it found better results, because the document selection step itself performs a form of relevance reasoning. The LLM reviewing the meta-index identified the Iran war article as relevant to AI military risks based on its title and section summaries, something the original path also did but only after spending time searching 14 other irrelevant documents first.

LLM Call Economics

MetricOriginalOptimized
LLM calls per query161 + up to 5 = max 6
Total calls (4 queries)64~24
Time per call (avg)~12s~12s
Time in irrelevant docs~75%~0%

The per-call cost is identical – same model, same hardware, same inference speed. The optimization is entirely structural: we eliminated the calls that were never going to return useful results.

Scaling Analysis

The optimization's value increases with corpus size. The meta-index call is O(1) relative to per-document tree search cost – it processes titles and summaries, not full trees. As the corpus grows, the fraction of time spent on the meta-index shrinks while the savings from fewer tree searches grow.

Corpus SizeOriginal (est.)Optimized (est.)Speedup
16 documents200s75s2.7x
50 documents625s75s8.3x
100 documents1,250s75s16.7x
500 documents6,250s75s83.3x

These estimates assume the meta-index always selects 5 candidates (the configured maximum) and that each tree search takes ~12 seconds. The meta-index selection call adds approximately 15 seconds regardless of corpus size. The key insight: optimized query time is constant relative to corpus size (bounded by the candidate limit), while original query time grows linearly.

At 500 documents, the original path would take nearly 2 hours per query. The optimized path would still take approximately 75 seconds – the same as for 16 documents, because it only ever searches 5 of them.

This does not eliminate the O(N) problem entirely – the meta-index prompt grows with corpus size as document summaries accumulate. At some point (likely 200-500 documents depending on summary length), the meta-index itself exceeds the model's context window and would need to be hierarchically indexed. But for corpora up to a few hundred documents, the approach scales well.

What We Learned

The Card Catalogue Analogy Is Exact

The meta-index works for exactly the same reason that library card catalogues work: a compact summary of what is available allows efficient routing without examining every item. PageIndex's tree structures already contain the summaries needed for this – we just were not using them at the corpus level.

The implementation cost was modest: approximately 50 lines of code for the meta-index builder, 40 lines for the document selection prompt, and 20 lines for the fallback logic. The hardest part was not the algorithm but the engineering around LLM reliability – handling JSON parse failures, timeouts, and model loading delays.

Parallel Execution Has Diminishing Returns with Ollama

Threading the tree searches provided minimal benefit because Ollama serializes inference for a single model. Requests queue rather than executing concurrently. The parallelism would matter with a multi-GPU setup, a batched inference server, or a cloud LLM API that handles concurrent requests natively.

We kept the parallel implementation because it costs nothing when it does not help and will provide immediate speedups if the inference backend is upgraded. The code is agnostic about whether the parallelism is exploited.

Model Keep-Alive Is a Hidden Variable

During benchmarking, we discovered that Ollama's default 5-minute model keep-alive caused severe performance inconsistency. Between queries, the model would be unloaded from VRAM, and the next query would pay a 60-120 second model loading penalty before inference could begin. This made our initial benchmark results unreliable – some queries appeared 10x slower purely due to model loading.

Setting keep_alive: "60m" via Ollama's API eliminated this variance. For any production deployment of PageIndex (or any Ollama-backed service with intermittent query patterns), configuring an appropriate keep-alive is essential.

The Meta-Index Is Not Free

The document selection call adds approximately 15 seconds of overhead. For a 16-document corpus where the original path takes 200 seconds, this is acceptable – you spend 15 seconds to save 125 seconds. But for a corpus of 5 or fewer documents, the meta-index overhead exceeds the savings. The system automatically skips the meta-index when the document count is at or below the candidate limit (configurable, default 5).

Configuration

All optimizations are controlled via environment variables and are enabled by default:

VariableDefaultDescription
PAGEINDEX_META_INDEXtrueEnable meta-index pre-filtering
PAGEINDEX_MAX_CANDIDATES5Maximum documents to search after filtering
PAGEINDEX_PARALLELtrueRun tree searches concurrently
PAGEINDEX_MODELqwen3.5:4bLLM for indexing and retrieval

The original unoptimized query path is preserved as query_original() and exposed via a /search/pageindex-original endpoint for benchmarking. A hybrid mode combining vector pre-filtering with PageIndex tree search is available as the pageindex-hybrid provider.

Conclusion

Three optimizations – meta-index pre-filtering, parallel tree search, and in-memory caching – reduced PageIndex query times by 2.6x on average, with peaks of 5.5x for queries where the meta-index effectively narrowed the candidate set. More importantly, the optimized path found all of the original results plus additional relevant hits, demonstrating that the meta-index selection step performs its own useful form of relevance reasoning.

The key findings:

  1. LLM call reduction is the primary lever. Cutting from 16 to 6 LLM calls per query accounts for nearly all of the speedup. The per-call cost is fixed by the model and hardware; the number of calls is an architectural choice.
  2. Optimized query time is approximately constant. Because the candidate limit caps the number of tree searches, query time does not grow with corpus size (up to the meta-index context window limit). This changes PageIndex from O(N) to effectively O(1) for corpora under a few hundred documents.
  3. Pre-filtering improves recall. The meta-index does not just save time – it focuses the LLM's attention on relevant documents, leading to more and better hits. The "AI military risks" query found 3 relevant sections with the meta-index versus 1 without it.
  4. Graceful fallback eliminates downside risk. When the meta-index call fails (timeout, bad JSON), the system falls back to searching all documents. The optimization never makes things worse.
  5. The remaining bottleneck is LLM inference speed. With 5 candidate documents at ~12 seconds each, the floor for optimized query time is approximately 75 seconds. Further improvement requires faster inference (larger GPUs, quantized models, batched serving) rather than architectural changes.

PageIndex remains slower than vector search by orders of magnitude. But where the original implementation was impractical for anything beyond a handful of documents, the optimized version is viable for corpora of up to several hundred documents – particularly in scenarios where precision matters more than latency, or where the query can run asynchronously while the user does other work.

The combination of a lightweight meta-index with targeted deep search mirrors how human experts navigate large document collections: skim the table of contents, identify the relevant chapters, then read carefully. It turns out this is a good strategy for LLMs too.


Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-4B via Ollama (3.4 GB), 16-document news corpus, 4 queries. Original path: 2,398 seconds total. Optimized path: 913 seconds total. All original results preserved with additional relevant hits found. Meta-index file size: 11.4 KB for 16 documents.