How Small Can You Go? Testing PageIndex with a Sub-1B Parameter Model
Pushing the lower boundary: a 0.8B parameter model cannot reliably perform PageIndex's structured tree-based retrieval, establishing a practical minimum between 1B and 4B parameters.
In our previous experiment, we found that shrinking PageIndex's LLM from Qwen3-32B (20 GB) to Qwen3.5-4B (3.4 GB) improved everything: speed, retrieval quality, and VRAM usage. The smaller model was better on every dimension we measured.
The natural follow-up question: how far down can we go? If 4B is better than 32B, is 0.8B better still? Qwen3.5-0.8b (1.0 GB) is 4x smaller than 4B and 32x smaller than 32B. This experiment tests whether a sub-1B parameter model can still perform the structured reasoning tasks that PageIndex requires: parsing documents into hierarchical trees, generating summaries for each node, and matching queries against those summaries at retrieval time.
The answer is unambiguous: no, it cannot.
The Experiment
What Changed
| 32B Run | 4B Run | This Run | |
|---|---|---|---|
| PageIndex model | Qwen3-32B (20 GB) | Qwen3.5-4B (3.4 GB) | Qwen3.5-0.8b (1.0 GB) |
| Model generation | Qwen3 | Qwen3.5 | Qwen3.5 |
| Vector providers | Same | Same | Same |
| Corpus | 18 documents | 18 documents | 18 documents |
| Queries | 16 queries | 16 queries | 16 queries |
| Hardware | RTX 5090 | RTX 5090 | RTX 5090 |
| Upload timeout | N/A | 15 min | 15 min |
| Query timeout | 10 min | 10 min | 10 min |
Same corpus, same queries, same hardware, same vector provider data. Only the PageIndex model changed.
Results
Ingest: Catastrophic Failure
The 4B model ingested all 18 documents. The 0.8B model ingested 7.
| Metric | Qwen3-32B | Qwen3.5-4B | Qwen3.5-0.8b |
|---|---|---|---|
| Documents ingested | 18/18 | 18/18 | 7/18 (39%) |
| Total ingest time | ~3,500s | 961s | 10,213s |
| Ingest failures | 0 | 0 | 11 |
| Avg time (successful) | ~194s | 53s | 44.7s |
| Total nodes | 85 | 68 | 17 |
The 7 successful documents were overwhelmingly tiny files:
| Document | Format | Size | 0.8B Time | 4B Time | Nodes |
|---|---|---|---|---|---|
| AI boom article | MD | 0.6 KB | 0.1s | <1s | 2 |
| Oscar stunt design | MD | 10 KB | 312.4s | 8.7s | 3 |
| Cortisol supplements | PPTX | 6 KB | <1s | <1s | 2 |
| View from nowhere | MD | 0.7 KB | 0.1s | <1s | 3 |
| Funny scene extraction | DOCX | 41 KB | 0.1s | <1s | 1 |
| 3D perception paper | MD | 1.7 KB | <1s | <1s | 3 |
| Pritzker/Epstein | MD | 0.9 KB | 0.1s | <1s | 3 |
The pattern is stark: files under ~2 KB succeeded instantly. One 10 KB file succeeded after 312 seconds (36x slower than the 4B model). Everything else – including a 4 KB markdown file – timed out after 15 minutes.
The 11 failures include documents that the 4B model processed in seconds:
| Document | Format | Size | 0.8B Result | 4B Time |
|---|---|---|---|---|
| Stock moves | MD | 3.6 KB | TIMEOUT | 28s |
| Muppet figures | DOCX | 212 KB | TIMEOUT | 18s |
| Eye health | MD | 13.9 KB | TIMEOUT | 37s |
| American censorship | 174 KB | TIMEOUT | 172s | |
| Vonn surgery | MD | 4.1 KB | TIMEOUT | 10s |
| Iran war economy | MD | 14.7 KB | TIMEOUT | 15s |
| Juventus/Roma | PPTX | 42.8 KB | TIMEOUT | 22s |
| Callahan/Giants | MD | 4.7 KB | TIMEOUT | 7s |
| Nothing/anything | 330 KB | TIMEOUT | 584s | |
| WiFi security | MD | 9.2 KB | TIMEOUT | 21s |
| Marathon costs | MD | 11.3 KB | TIMEOUT | 37s |
The WiFi security article – the star of the 4B benchmark, where PageIndex found a connection between "Firefox privacy features" and VPN/browser security that every vector provider missed – could not even be indexed.
Why It Fails: Malformed JSON in a Retry Loop
The server logs reveal the failure mechanism. PageIndex requires the LLM to produce structured JSON output: tree nodes with summaries, hierarchical relationships, and relevance scores. The 0.8B model generates text that PageIndex's parser cannot interpret as valid JSON. The OpenAI client retries the request, the model fails again, and the cycle continues until the timeout.
15:11:08 INFO PageIndex: converting 'stocks-making-...md' to Markdown
15:21:09 INFO Retrying request to /chat/completions in 0.385s
15:26:08 INFO [next document starts -- previous timed out]A 10-minute gap between the start of processing and the first logged retry, followed by the next document starting – the 0.8B model spent the entire timeout window producing output that could not be parsed. This is not a speed problem. It is a capability problem.
Query: Complete Failure
With only 7 documents indexed (17 nodes), even the query phase should have been fast. It was not.
| Metric | Qwen3-32B | Qwen3.5-4B | Qwen3.5-0.8b |
|---|---|---|---|
| Queries completed | 7/16 | 5/16 | 0/16 |
| Queries timed out | 9/16 | 11/16 | 16/16 |
| Avg query time (completed) | 575s | 305s | N/A |
| Total query time | ~8,000s | 8,146s | 9,633s |
Zero queries completed. Every single one of the 16 queries timed out after 10 minutes. The 0.8B model cannot perform query-time reasoning against PageIndex's tree structures – not for a 7-document corpus, not for any document count.
For comparison, the vector providers were unaffected:
| Provider | Avg Latency | Queries Completed |
|---|---|---|
| ollama | 1,692ms | 16/16 |
| pplx | 77ms | 16/16 |
| pplx-ctx | 249ms | 16/16 |
| pageindex (0.8B) | N/A | 0/16 |
Tree Structure
The 7 successfully indexed documents produced 17 nodes – compared to 68 nodes from 18 documents with the 4B model and 85 with the 32B model. The average of 2.4 nodes per document (vs 3.8 with 4B) suggests the 0.8B model generates extremely shallow trees, likely because it struggles to decompose documents into meaningful hierarchical sections.
| Metric | Qwen3-32B | Qwen3.5-4B | Qwen3.5-0.8b |
|---|---|---|---|
| Total nodes | 85 | 68 | 17 |
| Avg nodes per doc | 4.7 | 3.8 | 2.4 |
| Storage | 0.130 MB | 0.130 MB | 0.019 MB |
The Three-Model Picture
Putting all three experiments together reveals a clear capability cliff:
| Dimension | Qwen3-32B (20 GB) | Qwen3.5-4B (3.4 GB) | Qwen3.5-0.8b (1.0 GB) |
|---|---|---|---|
| Ingest success | 18/18 (100%) | 18/18 (100%) | 7/18 (39%) |
| Ingest time | ~3,500s | 961s | 10,213s* |
| Query completion | 7/16 (44%) | 5/16 (31%) | 0/16 (0%) |
| Relevant results | 1/7 | 5/5 | N/A |
| Avg query (completed) | 9.6 min | 5.1 min | N/A |
| Total nodes | 85 | 68 | 17 |
| VRAM | 20 GB | 3.4 GB | 1.0 GB |
*Dominated by 11 timeout failures at 15 min each.
The progression from 32B to 4B was "faster and better." The progression from 4B to 0.8B is "non-functional." This is not a gradual degradation – it is a cliff.
Analysis
Structured Output Is the Bottleneck
PageIndex does not need a large model for general intelligence. It needs a model that can reliably produce structured JSON conforming to a specific schema. The 4B model can do this. The 0.8B model cannot. The failure mode is not "worse quality answers" – it is "no parseable output at all."
This makes intuitive sense. Structured output generation (following a schema, matching brackets, maintaining consistent key names) is a learned capability that requires a minimum number of parameters. Sub-1B models are typically strong at simple text generation but weak at constrained output formats, especially JSON with nested structures.
The Size/Capability Curve Is Not Linear
If performance scaled linearly with model size, the 0.8B model would be 5x slower than the 4B model but still functional. Instead, we see a sharp transition:
- 32B: Works, slow, conservative in relevance judgments
- 4B: Works, faster, better relevance judgments
- 0.8B: Does not work
The minimum viable model for PageIndex lies somewhere between 0.8B and 4B parameters. Without testing intermediate sizes (e.g., Qwen3.5-1.5B or Qwen3.5-3B), we cannot pinpoint the exact threshold, but the practical recommendation is clear: do not go below 4B.
Speed Without Function Is Not Speed
The 0.8B model is fast when it works – the successfully indexed documents completed in under a second. But "fast at 39% of documents and 0% of queries" is not a speed improvement. The total benchmark time was 5.5 hours (10,213s uploads + 9,633s queries), longer than the 4B benchmark despite processing fewer documents, because timeout failures consume maximum time while producing no value.
The Firefox Finding Is Lost
The 4B model's most impressive result – connecting "Firefox privacy features" to a WiFi security article about VPN usage and browser HTTPS protections – required two things: (1) successfully indexing the WiFi article, and (2) successfully reasoning about the query at retrieval time. The 0.8B model failed at step 1. The article could not be indexed at all.
This underscores a key point: PageIndex's value proposition depends entirely on the LLM's reasoning capabilities. Strip those away and you have nothing – not even a degraded version of the approach, but a complete absence of results.
The Minimum Model Requirement
Our three experiments establish a practical boundary:
0.8B ----X---- too small (doesn't work)
???
4B ----✓---- works well (sweet spot)
???
32B ----✓---- works, but slower and less effectiveThe minimum viable model for PageIndex with Ollama-served Qwen models is between 1B and 4B parameters. For production use, 4B is the recommended minimum – it provides the best balance of speed, quality, and resource usage we have observed.
Practical Guidance
- Do not use sub-1B models for PageIndex. They cannot reliably generate the structured JSON output that PageIndex requires for tree building and query matching.
- 4B is the sweet spot. It outperformed 32B on every metric while using 6x less VRAM. Going smaller does not save meaningful resources (1 GB vs 3.4 GB VRAM) but costs you all functionality.
- The failure mode is binary, not gradual. There is no "degraded quality" regime between 4B and 0.8B. The model either produces valid structured output or it does not. Plan accordingly.
- If you need to go smaller than 4B, test exhaustively before deploying. The cliff between "works" and "doesn't work" is steep, and you will not get a graceful degradation warning.
Conclusion
We set out to answer "how small can you go?" and found a clear answer: not this small. Qwen3.5-0.8b (1.0 GB) cannot perform PageIndex's structured tree-based retrieval. It failed to index 61% of documents, failed 100% of queries, and consumed more total time than the functional 4B model while producing zero usable results.
The key findings:
- There is a hard capability floor for structured RAG. Below some threshold between 1B and 4B parameters, LLMs cannot reliably generate the JSON structures that PageIndex requires. This is a capability cliff, not a gradual slope.
- The failure mode is malformed output, not slow output. The 0.8B model is fast when producing text, but it cannot constrain that text to valid JSON schemas. PageIndex enters a retry loop until timeout.
- The 4B model remains the optimal choice. It is the smallest model we tested that actually works, and it outperforms the 32B model on both speed and quality.
- Document size correlates with failure. Files under ~2 KB succeeded; files above ~4 KB timed out. This suggests the 0.8B model can handle very short structured prompts but loses coherence with longer contexts.
- VRAM savings are negligible compared to the loss. Going from 3.4 GB (4B) to 1.0 GB (0.8B) saves 2.4 GB of VRAM. That saving buys you a system that does not work.
The three-article progression – 32B to 4B to 0.8B – maps the full landscape of model size vs. PageIndex performance. The answer is a satisfying inverted U: too large is slow, too small is broken, and the sweet spot is surprisingly modest at 4B parameters.
Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-0.8b via Ollama (1.0 GB), Milvus Standalone (vector providers), 18-document news corpus (7 successfully indexed), 16 queries across 9 topic categories. PageIndex completed 0 of 16 queries within the 10-minute timeout. Total benchmark duration: approximately 5.5 hours.
Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.