How Small Can You Go? Testing PageIndex with a Sub-1B Parameter Model

How Small Can You Go? Testing PageIndex with a Sub-1B Parameter Model

Pushing the lower boundary: a 0.8B parameter model cannot reliably perform PageIndex's structured tree-based retrieval, establishing a practical minimum between 1B and 4B parameters.


In our previous experiment, we found that shrinking PageIndex's LLM from Qwen3-32B (20 GB) to Qwen3.5-4B (3.4 GB) improved everything: speed, retrieval quality, and VRAM usage. The smaller model was better on every dimension we measured.

The natural follow-up question: how far down can we go? If 4B is better than 32B, is 0.8B better still? Qwen3.5-0.8b (1.0 GB) is 4x smaller than 4B and 32x smaller than 32B. This experiment tests whether a sub-1B parameter model can still perform the structured reasoning tasks that PageIndex requires: parsing documents into hierarchical trees, generating summaries for each node, and matching queries against those summaries at retrieval time.

The answer is unambiguous: no, it cannot.

The Experiment

What Changed

32B Run 4B Run This Run
PageIndex model Qwen3-32B (20 GB) Qwen3.5-4B (3.4 GB) Qwen3.5-0.8b (1.0 GB)
Model generation Qwen3 Qwen3.5 Qwen3.5
Vector providers Same Same Same
Corpus 18 documents 18 documents 18 documents
Queries 16 queries 16 queries 16 queries
Hardware RTX 5090 RTX 5090 RTX 5090
Upload timeout N/A 15 min 15 min
Query timeout 10 min 10 min 10 min

Same corpus, same queries, same hardware, same vector provider data. Only the PageIndex model changed.

Results

Ingest: Catastrophic Failure

The 4B model ingested all 18 documents. The 0.8B model ingested 7.

Metric Qwen3-32B Qwen3.5-4B Qwen3.5-0.8b
Documents ingested 18/18 18/18 7/18 (39%)
Total ingest time ~3,500s 961s 10,213s
Ingest failures 0 0 11
Avg time (successful) ~194s 53s 44.7s
Total nodes 85 68 17

The 7 successful documents were overwhelmingly tiny files:

Document Format Size 0.8B Time 4B Time Nodes
AI boom article MD 0.6 KB 0.1s <1s 2
Oscar stunt design MD 10 KB 312.4s 8.7s 3
Cortisol supplements PPTX 6 KB <1s <1s 2
View from nowhere MD 0.7 KB 0.1s <1s 3
Funny scene extraction DOCX 41 KB 0.1s <1s 1
3D perception paper MD 1.7 KB <1s <1s 3
Pritzker/Epstein MD 0.9 KB 0.1s <1s 3

The pattern is stark: files under ~2 KB succeeded instantly. One 10 KB file succeeded after 312 seconds (36x slower than the 4B model). Everything else – including a 4 KB markdown file – timed out after 15 minutes.

The 11 failures include documents that the 4B model processed in seconds:

Document Format Size 0.8B Result 4B Time
Stock moves MD 3.6 KB TIMEOUT 28s
Muppet figures DOCX 212 KB TIMEOUT 18s
Eye health MD 13.9 KB TIMEOUT 37s
American censorship PDF 174 KB TIMEOUT 172s
Vonn surgery MD 4.1 KB TIMEOUT 10s
Iran war economy MD 14.7 KB TIMEOUT 15s
Juventus/Roma PPTX 42.8 KB TIMEOUT 22s
Callahan/Giants MD 4.7 KB TIMEOUT 7s
Nothing/anything PDF 330 KB TIMEOUT 584s
WiFi security MD 9.2 KB TIMEOUT 21s
Marathon costs MD 11.3 KB TIMEOUT 37s

The WiFi security article – the star of the 4B benchmark, where PageIndex found a connection between "Firefox privacy features" and VPN/browser security that every vector provider missed – could not even be indexed.

Why It Fails: Malformed JSON in a Retry Loop

The server logs reveal the failure mechanism. PageIndex requires the LLM to produce structured JSON output: tree nodes with summaries, hierarchical relationships, and relevance scores. The 0.8B model generates text that PageIndex's parser cannot interpret as valid JSON. The OpenAI client retries the request, the model fails again, and the cycle continues until the timeout.

15:11:08 INFO  PageIndex: converting 'stocks-making-...md' to Markdown
15:21:09 INFO  Retrying request to /chat/completions in 0.385s
15:26:08 INFO  [next document starts -- previous timed out]

A 10-minute gap between the start of processing and the first logged retry, followed by the next document starting – the 0.8B model spent the entire timeout window producing output that could not be parsed. This is not a speed problem. It is a capability problem.

Query: Complete Failure

With only 7 documents indexed (17 nodes), even the query phase should have been fast. It was not.

Metric Qwen3-32B Qwen3.5-4B Qwen3.5-0.8b
Queries completed 7/16 5/16 0/16
Queries timed out 9/16 11/16 16/16
Avg query time (completed) 575s 305s N/A
Total query time ~8,000s 8,146s 9,633s

Zero queries completed. Every single one of the 16 queries timed out after 10 minutes. The 0.8B model cannot perform query-time reasoning against PageIndex's tree structures – not for a 7-document corpus, not for any document count.

For comparison, the vector providers were unaffected:

Provider Avg Latency Queries Completed
ollama 1,692ms 16/16
pplx 77ms 16/16
pplx-ctx 249ms 16/16
pageindex (0.8B) N/A 0/16

Tree Structure

The 7 successfully indexed documents produced 17 nodes – compared to 68 nodes from 18 documents with the 4B model and 85 with the 32B model. The average of 2.4 nodes per document (vs 3.8 with 4B) suggests the 0.8B model generates extremely shallow trees, likely because it struggles to decompose documents into meaningful hierarchical sections.

Metric Qwen3-32B Qwen3.5-4B Qwen3.5-0.8b
Total nodes 85 68 17
Avg nodes per doc 4.7 3.8 2.4
Storage 0.130 MB 0.130 MB 0.019 MB

The Three-Model Picture

Putting all three experiments together reveals a clear capability cliff:

Dimension Qwen3-32B (20 GB) Qwen3.5-4B (3.4 GB) Qwen3.5-0.8b (1.0 GB)
Ingest success 18/18 (100%) 18/18 (100%) 7/18 (39%)
Ingest time ~3,500s 961s 10,213s*
Query completion 7/16 (44%) 5/16 (31%) 0/16 (0%)
Relevant results 1/7 5/5 N/A
Avg query (completed) 9.6 min 5.1 min N/A
Total nodes 85 68 17
VRAM 20 GB 3.4 GB 1.0 GB

*Dominated by 11 timeout failures at 15 min each.

The progression from 32B to 4B was "faster and better." The progression from 4B to 0.8B is "non-functional." This is not a gradual degradation – it is a cliff.

Analysis

Structured Output Is the Bottleneck

PageIndex does not need a large model for general intelligence. It needs a model that can reliably produce structured JSON conforming to a specific schema. The 4B model can do this. The 0.8B model cannot. The failure mode is not "worse quality answers" – it is "no parseable output at all."

This makes intuitive sense. Structured output generation (following a schema, matching brackets, maintaining consistent key names) is a learned capability that requires a minimum number of parameters. Sub-1B models are typically strong at simple text generation but weak at constrained output formats, especially JSON with nested structures.

The Size/Capability Curve Is Not Linear

If performance scaled linearly with model size, the 0.8B model would be 5x slower than the 4B model but still functional. Instead, we see a sharp transition:

  • 32B: Works, slow, conservative in relevance judgments
  • 4B: Works, faster, better relevance judgments
  • 0.8B: Does not work

The minimum viable model for PageIndex lies somewhere between 0.8B and 4B parameters. Without testing intermediate sizes (e.g., Qwen3.5-1.5B or Qwen3.5-3B), we cannot pinpoint the exact threshold, but the practical recommendation is clear: do not go below 4B.

Speed Without Function Is Not Speed

The 0.8B model is fast when it works – the successfully indexed documents completed in under a second. But "fast at 39% of documents and 0% of queries" is not a speed improvement. The total benchmark time was 5.5 hours (10,213s uploads + 9,633s queries), longer than the 4B benchmark despite processing fewer documents, because timeout failures consume maximum time while producing no value.

The Firefox Finding Is Lost

The 4B model's most impressive result – connecting "Firefox privacy features" to a WiFi security article about VPN usage and browser HTTPS protections – required two things: (1) successfully indexing the WiFi article, and (2) successfully reasoning about the query at retrieval time. The 0.8B model failed at step 1. The article could not be indexed at all.

This underscores a key point: PageIndex's value proposition depends entirely on the LLM's reasoning capabilities. Strip those away and you have nothing – not even a degraded version of the approach, but a complete absence of results.

The Minimum Model Requirement

Our three experiments establish a practical boundary:

0.8B  ----X---- too small (doesn't work)
       ???
4B    ----✓---- works well (sweet spot)
       ???
32B   ----✓---- works, but slower and less effective

The minimum viable model for PageIndex with Ollama-served Qwen models is between 1B and 4B parameters. For production use, 4B is the recommended minimum – it provides the best balance of speed, quality, and resource usage we have observed.

Practical Guidance

  1. Do not use sub-1B models for PageIndex. They cannot reliably generate the structured JSON output that PageIndex requires for tree building and query matching.
  2. 4B is the sweet spot. It outperformed 32B on every metric while using 6x less VRAM. Going smaller does not save meaningful resources (1 GB vs 3.4 GB VRAM) but costs you all functionality.
  3. The failure mode is binary, not gradual. There is no "degraded quality" regime between 4B and 0.8B. The model either produces valid structured output or it does not. Plan accordingly.
  4. If you need to go smaller than 4B, test exhaustively before deploying. The cliff between "works" and "doesn't work" is steep, and you will not get a graceful degradation warning.

Conclusion

We set out to answer "how small can you go?" and found a clear answer: not this small. Qwen3.5-0.8b (1.0 GB) cannot perform PageIndex's structured tree-based retrieval. It failed to index 61% of documents, failed 100% of queries, and consumed more total time than the functional 4B model while producing zero usable results.

The key findings:

  1. There is a hard capability floor for structured RAG. Below some threshold between 1B and 4B parameters, LLMs cannot reliably generate the JSON structures that PageIndex requires. This is a capability cliff, not a gradual slope.
  2. The failure mode is malformed output, not slow output. The 0.8B model is fast when producing text, but it cannot constrain that text to valid JSON schemas. PageIndex enters a retry loop until timeout.
  3. The 4B model remains the optimal choice. It is the smallest model we tested that actually works, and it outperforms the 32B model on both speed and quality.
  4. Document size correlates with failure. Files under ~2 KB succeeded; files above ~4 KB timed out. This suggests the 0.8B model can handle very short structured prompts but loses coherence with longer contexts.
  5. VRAM savings are negligible compared to the loss. Going from 3.4 GB (4B) to 1.0 GB (0.8B) saves 2.4 GB of VRAM. That saving buys you a system that does not work.

The three-article progression – 32B to 4B to 0.8B – maps the full landscape of model size vs. PageIndex performance. The answer is a satisfying inverted U: too large is slow, too small is broken, and the sweet spot is surprisingly modest at 4B parameters.


Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-0.8b via Ollama (1.0 GB), Milvus Standalone (vector providers), 18-document news corpus (7 successfully indexed), 16 queries across 9 topic categories. PageIndex completed 0 of 16 queries within the 10-minute timeout. Total benchmark duration: approximately 5.5 hours.

Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.