How Small Can You Go? Testing PageIndex with a Sub-1B Parameter Model

Kirk Ryan

06 Mar 2026 • 8 min read

Pushing the lower boundary: a 0.8B parameter model cannot reliably perform PageIndex's structured tree-based retrieval, establishing a practical minimum between 1B and 4B parameters.

In our previous experiment, we found that shrinking PageIndex's LLM from Qwen3-32B (20 GB) to Qwen3.5-4B (3.4 GB) improved everything: speed, retrieval quality, and VRAM usage. The smaller model was better on every dimension we measured.

The natural follow-up question: how far down can we go? If 4B is better than 32B, is 0.8B better still? Qwen3.5-0.8b (1.0 GB) is 4x smaller than 4B and 32x smaller than 32B. This experiment tests whether a sub-1B parameter model can still perform the structured reasoning tasks that PageIndex requires: parsing documents into hierarchical trees, generating summaries for each node, and matching queries against those summaries at retrieval time.

The answer is unambiguous: no, it cannot.

The Experiment

What Changed

	32B Run	4B Run	This Run
PageIndex model	Qwen3-32B (20 GB)	Qwen3.5-4B (3.4 GB)	Qwen3.5-0.8b (1.0 GB)
Model generation	Qwen3	Qwen3.5	Qwen3.5
Vector providers	Same	Same	Same
Corpus	18 documents	18 documents	18 documents
Queries	16 queries	16 queries	16 queries
Hardware	RTX 5090	RTX 5090	RTX 5090
Upload timeout	N/A	15 min	15 min
Query timeout	10 min	10 min	10 min

Same corpus, same queries, same hardware, same vector provider data. Only the PageIndex model changed.

Results

Ingest: Catastrophic Failure

The 4B model ingested all 18 documents. The 0.8B model ingested 7.

Metric	Qwen3-32B	Qwen3.5-4B	Qwen3.5-0.8b
Documents ingested	18/18	18/18	7/18 (39%)
Total ingest time	~3,500s	961s	10,213s
Ingest failures	0	0	11
Avg time (successful)	~194s	53s	44.7s
Total nodes	85	68	17

The 7 successful documents were overwhelmingly tiny files:

Document	Format	Size	0.8B Time	4B Time	Nodes
AI boom article	MD	0.6 KB	0.1s	<1s	2
Oscar stunt design	MD	10 KB	312.4s	8.7s	3
Cortisol supplements	PPTX	6 KB	<1s	<1s	2
View from nowhere	MD	0.7 KB	0.1s	<1s	3
Funny scene extraction	DOCX	41 KB	0.1s	<1s	1
3D perception paper	MD	1.7 KB	<1s	<1s	3
Pritzker/Epstein	MD	0.9 KB	0.1s	<1s	3

The pattern is stark: files under ~2 KB succeeded instantly. One 10 KB file succeeded after 312 seconds (36x slower than the 4B model). Everything else – including a 4 KB markdown file – timed out after 15 minutes.

The 11 failures include documents that the 4B model processed in seconds:

Document	Format	Size	0.8B Result	4B Time
Stock moves	MD	3.6 KB	TIMEOUT	28s
Muppet figures	DOCX	212 KB	TIMEOUT	18s
Eye health	MD	13.9 KB	TIMEOUT	37s
American censorship	PDF	174 KB	TIMEOUT	172s
Vonn surgery	MD	4.1 KB	TIMEOUT	10s
Iran war economy	MD	14.7 KB	TIMEOUT	15s
Juventus/Roma	PPTX	42.8 KB	TIMEOUT	22s
Callahan/Giants	MD	4.7 KB	TIMEOUT	7s
Nothing/anything	PDF	330 KB	TIMEOUT	584s
WiFi security	MD	9.2 KB	TIMEOUT	21s
Marathon costs	MD	11.3 KB	TIMEOUT	37s

The WiFi security article – the star of the 4B benchmark, where PageIndex found a connection between "Firefox privacy features" and VPN/browser security that every vector provider missed – could not even be indexed.

Why It Fails: Malformed JSON in a Retry Loop

The server logs reveal the failure mechanism. PageIndex requires the LLM to produce structured JSON output: tree nodes with summaries, hierarchical relationships, and relevance scores. The 0.8B model generates text that PageIndex's parser cannot interpret as valid JSON. The OpenAI client retries the request, the model fails again, and the cycle continues until the timeout.

15:11:08 INFO  PageIndex: converting 'stocks-making-...md' to Markdown
15:21:09 INFO  Retrying request to /chat/completions in 0.385s
15:26:08 INFO  [next document starts -- previous timed out]

A 10-minute gap between the start of processing and the first logged retry, followed by the next document starting – the 0.8B model spent the entire timeout window producing output that could not be parsed. This is not a speed problem. It is a capability problem.

Query: Complete Failure

With only 7 documents indexed (17 nodes), even the query phase should have been fast. It was not.

Metric	Qwen3-32B	Qwen3.5-4B	Qwen3.5-0.8b
Queries completed	7/16	5/16	0/16
Queries timed out	9/16	11/16	16/16
Avg query time (completed)	575s	305s	N/A
Total query time	~8,000s	8,146s	9,633s

Zero queries completed. Every single one of the 16 queries timed out after 10 minutes. The 0.8B model cannot perform query-time reasoning against PageIndex's tree structures – not for a 7-document corpus, not for any document count.

For comparison, the vector providers were unaffected:

Provider	Avg Latency	Queries Completed
ollama	1,692ms	16/16
pplx	77ms	16/16
pplx-ctx	249ms	16/16
pageindex (0.8B)	N/A	0/16

Tree Structure

The 7 successfully indexed documents produced 17 nodes – compared to 68 nodes from 18 documents with the 4B model and 85 with the 32B model. The average of 2.4 nodes per document (vs 3.8 with 4B) suggests the 0.8B model generates extremely shallow trees, likely because it struggles to decompose documents into meaningful hierarchical sections.

Metric	Qwen3-32B	Qwen3.5-4B	Qwen3.5-0.8b
Total nodes	85	68	17
Avg nodes per doc	4.7	3.8	2.4
Storage	0.130 MB	0.130 MB	0.019 MB

The Three-Model Picture

Putting all three experiments together reveals a clear capability cliff:

Dimension	Qwen3-32B (20 GB)	Qwen3.5-4B (3.4 GB)	Qwen3.5-0.8b (1.0 GB)
Ingest success	18/18 (100%)	18/18 (100%)	7/18 (39%)
Ingest time	~3,500s	961s	10,213s*
Query completion	7/16 (44%)	5/16 (31%)	0/16 (0%)
Relevant results	1/7	5/5	N/A
Avg query (completed)	9.6 min	5.1 min	N/A
Total nodes	85	68	17
VRAM	20 GB	3.4 GB	1.0 GB

*Dominated by 11 timeout failures at 15 min each.

The progression from 32B to 4B was "faster and better." The progression from 4B to 0.8B is "non-functional." This is not a gradual degradation – it is a cliff.

Analysis

Structured Output Is the Bottleneck

PageIndex does not need a large model for general intelligence. It needs a model that can reliably produce structured JSON conforming to a specific schema. The 4B model can do this. The 0.8B model cannot. The failure mode is not "worse quality answers" – it is "no parseable output at all."

This makes intuitive sense. Structured output generation (following a schema, matching brackets, maintaining consistent key names) is a learned capability that requires a minimum number of parameters. Sub-1B models are typically strong at simple text generation but weak at constrained output formats, especially JSON with nested structures.

The Size/Capability Curve Is Not Linear

If performance scaled linearly with model size, the 0.8B model would be 5x slower than the 4B model but still functional. Instead, we see a sharp transition:

32B: Works, slow, conservative in relevance judgments
4B: Works, faster, better relevance judgments
0.8B: Does not work

The minimum viable model for PageIndex lies somewhere between 0.8B and 4B parameters. Without testing intermediate sizes (e.g., Qwen3.5-1.5B or Qwen3.5-3B), we cannot pinpoint the exact threshold, but the practical recommendation is clear: do not go below 4B.

Speed Without Function Is Not Speed

The 0.8B model is fast when it works – the successfully indexed documents completed in under a second. But "fast at 39% of documents and 0% of queries" is not a speed improvement. The total benchmark time was 5.5 hours (10,213s uploads + 9,633s queries), longer than the 4B benchmark despite processing fewer documents, because timeout failures consume maximum time while producing no value.

The Firefox Finding Is Lost

The 4B model's most impressive result – connecting "Firefox privacy features" to a WiFi security article about VPN usage and browser HTTPS protections – required two things: (1) successfully indexing the WiFi article, and (2) successfully reasoning about the query at retrieval time. The 0.8B model failed at step 1. The article could not be indexed at all.

This underscores a key point: PageIndex's value proposition depends entirely on the LLM's reasoning capabilities. Strip those away and you have nothing – not even a degraded version of the approach, but a complete absence of results.

The Minimum Model Requirement

Our three experiments establish a practical boundary:

0.8B  ----X---- too small (doesn't work)
       ???
4B    ----✓---- works well (sweet spot)
       ???
32B   ----✓---- works, but slower and less effective

The minimum viable model for PageIndex with Ollama-served Qwen models is between 1B and 4B parameters. For production use, 4B is the recommended minimum – it provides the best balance of speed, quality, and resource usage we have observed.

Practical Guidance

Do not use sub-1B models for PageIndex. They cannot reliably generate the structured JSON output that PageIndex requires for tree building and query matching.
4B is the sweet spot. It outperformed 32B on every metric while using 6x less VRAM. Going smaller does not save meaningful resources (1 GB vs 3.4 GB VRAM) but costs you all functionality.
The failure mode is binary, not gradual. There is no "degraded quality" regime between 4B and 0.8B. The model either produces valid structured output or it does not. Plan accordingly.
If you need to go smaller than 4B, test exhaustively before deploying. The cliff between "works" and "doesn't work" is steep, and you will not get a graceful degradation warning.

Conclusion

We set out to answer "how small can you go?" and found a clear answer: not this small. Qwen3.5-0.8b (1.0 GB) cannot perform PageIndex's structured tree-based retrieval. It failed to index 61% of documents, failed 100% of queries, and consumed more total time than the functional 4B model while producing zero usable results.

The key findings:

There is a hard capability floor for structured RAG. Below some threshold between 1B and 4B parameters, LLMs cannot reliably generate the JSON structures that PageIndex requires. This is a capability cliff, not a gradual slope.
The failure mode is malformed output, not slow output. The 0.8B model is fast when producing text, but it cannot constrain that text to valid JSON schemas. PageIndex enters a retry loop until timeout.
The 4B model remains the optimal choice. It is the smallest model we tested that actually works, and it outperforms the 32B model on both speed and quality.
Document size correlates with failure. Files under ~2 KB succeeded; files above ~4 KB timed out. This suggests the 0.8B model can handle very short structured prompts but loses coherence with longer contexts.
VRAM savings are negligible compared to the loss. Going from 3.4 GB (4B) to 1.0 GB (0.8B) saves 2.4 GB of VRAM. That saving buys you a system that does not work.

The three-article progression – 32B to 4B to 0.8B – maps the full landscape of model size vs. PageIndex performance. The answer is a satisfying inverted U: too large is slow, too small is broken, and the sweet spot is surprisingly modest at 4B parameters.

Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-0.8b via Ollama (1.0 GB), Milvus Standalone (vector providers), 18-document news corpus (7 successfully indexed), 16 queries across 9 topic categories. PageIndex completed 0 of 16 queries within the 10-minute timeout. Total benchmark duration: approximately 5.5 hours.

Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.