Does Model Size Matter for Vectorless RAG? Testing PageIndex with a 4B Parameter Model
A follow-up experiment: swapping Qwen3-32B for Qwen3.5-4B in PageIndex to measure how LLM size affects tree-based retrieval speed and quality.
In our previous evaluation, we found that PageIndex – a vectorless RAG approach that uses LLM reasoning instead of embedding similarity – delivered impressive precision but at a devastating cost: 9.6-minute average query times with Qwen3-32B. The system was accurate when it completed, but 56% of queries timed out before finishing.
The obvious question: was the 32B model overkill? PageIndex uses the LLM for two tasks that do not obviously require a large model: summarizing document sections during indexing, and matching queries against summaries during retrieval. These are structured reasoning tasks with clear instructions, not open-ended generation. A smaller model might handle them just as well – and much faster.
We swapped Qwen3-32B (20 GB) for Qwen3.5-4B (3.4 GB) and re-ran the same benchmark. The results challenge some assumptions about when model size matters.
The Experiment
What Changed
| Previous Run | This Run | |
|---|---|---|
| PageIndex model | Qwen3-32B (20 GB) | Qwen3.5-4B (3.4 GB) |
| Model generation | Qwen3 | Qwen3.5 |
| Vector providers | Same (ollama, pplx, pplx-ctx) | Same |
| Corpus | 18 documents, 9 topics | 18 documents, 9 topics |
| Queries | 16 queries | 16 queries |
| Hardware | RTX 5090 (32 GB VRAM) | Same |
| Timeout | 10 minutes | 10 minutes |
Everything except the PageIndex model stayed the same. Same 16 queries, same hardware, same vector provider data, same timeout thresholds.
The model shift is significant: Qwen3.5-4B has 8x fewer parameters than Qwen3-32B, uses 6x less VRAM (3.4 GB vs 20 GB), and belongs to a newer model generation (3.5 vs 3). The question is whether those 28 billion fewer parameters cost us retrieval quality.
Corpus
The corpus comprised 18 documents (2 per topic) across 9 categories, with a mix of formats:
| Format | Count | Notes |
|---|---|---|
| Markdown | 12 | News articles, opinion pieces, technical content |
| DOCX | 2 | Research summaries, reviews |
| PPTX | 2 | Sports commentary, health supplements |
| 2 | Magazine article, academic paper | |
| Total | 18 | 0.78 MB source data |
Results
Ingest Performance
The first and most dramatic improvement: every single document uploaded successfully.
| Metric | Qwen3-32B | Qwen3.5-4B | Change |
|---|---|---|---|
| Documents ingested | 18/18 | 18/18 | Same |
| Total ingest time | ~3,500s | 961s | 3.6x faster |
| Average per document | ~194s | 53s | 3.7x faster |
| Fastest document | ~7s | <1s | Similar |
| Slowest document | ~459s | 585s | Comparable |
The per-document times tell the story:
| Document | Format | Size | 32B Time | 4B Time | Speedup |
|---|---|---|---|---|---|
| 1KB AI boom article | MD | 0.6 KB | 12s | <1s | >12x |
| Eye health article | MD | 14 KB | 214s | 37s | 5.8x |
| Stunt design Oscar | MD | 10 KB | 69s | 9s | 7.7x |
| WiFi security article | MD | 9 KB | 21s | 21s | 1.0x |
| Cortisol supplements | PPTX | 6 KB | 8s | <1s | >8x |
| American censorship | 174 KB | 172s | 172s | 1.0x |
The pattern is clear: the 4B model processes text faster but PDF parsing time dominates for that format. For pure text-to-tree operations (Markdown, DOCX), the speedup ranges from 5-12x. For PDFs, where the bottleneck is Docling extraction and PageIndex's PDF parser rather than LLM inference, both models take similar time.
Tree Structure
The smaller model built slightly different tree structures:
| Metric | Qwen3-32B | Qwen3.5-4B |
|---|---|---|
| Total nodes | 85 | 68 |
| Avg nodes per document | 4.7 | 3.8 |
| Storage | 0.130 MB | 0.130 MB |
The 4B model produced 20% fewer nodes. This is expected – a smaller model generates more concise summaries and is less likely to split content into deep hierarchies. Whether this is better or worse depends on the documents; fewer nodes means faster query-time search (fewer things to evaluate per document) but potentially less granular retrieval.
Query Latency
This is the metric that matters most for PageIndex's viability.
| Metric | Qwen3-32B | Qwen3.5-4B | Change |
|---|---|---|---|
| Avg query time (completed) | 575s (9.6 min) | 305s (5.1 min) | 1.9x faster |
| Fastest query | 545s | 174s | 3.1x faster |
| Slowest query | 590s | 529s | 1.1x faster |
| Queries completed | 7/16 | 5/16 | Fewer |
| Queries timed out | 9/16 | 11/16 | More |
The speed improvement is real but uneven. The fastest query ("Firefox privacy features") took just 174 seconds with the 4B model – a 3.1x improvement. But the timeout rate actually increased from 56% to 69%.
Why? The 4B model is faster per LLM call (~8s vs ~30s), but it occasionally produces responses that require retries or takes longer to converge on structured JSON output. The net effect: faster when things work, but slightly less reliable at staying within the timeout window for all 18 documents.
Extrapolating from the successful queries:
| Corpus Size | Qwen3-32B Est. | Qwen3.5-4B Est. |
|---|---|---|
| 18 documents | 9.6 min | 5.1 min |
| 50 documents | 27 min | 14 min |
| 100 documents | 53 min | 28 min |
| 500 documents | 4.4 hours | 2.4 hours |
The smaller model cuts estimated query times roughly in half, but does not change the fundamental O(N) scaling. A 500-document corpus still takes hours, not seconds.
Retrieval Quality
This is the most surprising result: the 4B model was more aggressive at finding relevant content – and found genuinely useful matches that every other provider missed.
| Query | 32B Result | 4B Result |
|---|---|---|
| AI military risks | Science roundup (1.0) | Iran war article (1.0) |
| Silent chip defects | TIMEOUT | AI boom article (1.0) |
| Firefox privacy features | TIMEOUT | WiFi security article (1.0) |
| iPad Air M4 chip | Nothing found | TIMEOUT |
| Clouds in Antarctica | Nothing found | TIMEOUT |
| Waking up early | TIMEOUT | TIMEOUT |
| Celestial hamburgers | TIMEOUT | TIMEOUT |
| Inherited diseases | TIMEOUT | TIMEOUT |
| Ukraine war drones | Nothing found | TIMEOUT |
| Iran threats | Nothing found | Iran war article (1.0) |
| Trump AI executive orders | TIMEOUT | AI boom article (1.0) |
| MLS rule changes | Nothing found | TIMEOUT |
| Warner Bros sale | Nothing found | TIMEOUT |
| Banks data centers | TIMEOUT | TIMEOUT |
| AI safety autonomous weapons | TIMEOUT | TIMEOUT |
| Tech changing media | TIMEOUT | TIMEOUT |
Of the queries that completed, the 4B model returned results for 5 out of 5. The 32B model returned results for 1 out of 7. The smaller model has a lower relevance threshold – it is more willing to identify connections between queries and documents.
The Firefox Query
The standout result: for "Firefox privacy features," the 4B model found a WiFi security article discussing VPN services, browser security, and HTTPS protections. This is a genuinely relevant match – the article directly addresses browser privacy and network security, which is exactly what someone searching for "Firefox privacy features" would want to read about.
Every vector provider missed this connection entirely:
- ollama: returned an eye health article (score 0.574)
- pplx: returned a stock market after-hours article (score 0.210)
- pplx-ctx: returned the same stock market article (score 0.271)
The LLM understood that "Firefox privacy features" relates to browser security and VPN usage. No embedding model made that semantic leap.
The 4B model even returned multiple relevant sections from the WiFi article, ranked by specificity:
- "What matters more when it comes to public Wi-Fi security" (score 1.0)
- "The general internet is, too" – discussing VPN connection options (score 0.9)
- "Wi-Fi's built-in protections and HTTPS aren't foolproof" (score 0.8)
The Iran Threats Query
For "Iran threats," the 4B model found "The Iran War Is Jeopardizing the Entire Global Economy" – an article that discusses the geopolitical and economic impact of conflict in the Gulf. The 32B model, running on the same query, returned nothing. Meanwhile, ollama's vector search returned an article about Snoop Dogg performing in a Welsh canteen (score 0.498).
False Positives vs. False Relevance
The 4B model's lower relevance threshold does introduce a risk. For "Silent chip defects," it returned the AI boom article, reasoning that discussions of AI infrastructure and chip demand are related to semiconductor manufacturing. This is more of a stretch – the article discusses the economics of AI, not chip defects per se. A stricter model might (correctly) have returned nothing.
This reveals an interesting trade-off between model sizes: the 32B model was too conservative (returned nothing for most queries), while the 4B model was more liberal in its interpretation of relevance. For RAG applications, the 4B model's behavior is arguably more useful – returning loosely related content is better than returning nothing, as long as the downstream LLM can assess relevance in context.
Vector Provider Comparison
The vector providers performed identically to the previous benchmark (same data, same queries):
| Provider | Avg Score | Avg Latency | Behavior |
|---|---|---|---|
| ollama | 0.532 | 1,240ms | Always returns something, frequent false positives |
| pplx | 0.207 | 64ms | Lower scores, better semantic matching |
| pplx-ctx | 0.236 | 45ms | Similar to pplx |
| pageindex (4B) | 1.000* | 305,072ms | Returns results only when confident, high precision |
*Average of completed queries only.
The latency gap between vector providers and PageIndex narrowed but remains enormous: pplx returns in 45ms, PageIndex in 305,000ms. That is a 6,700x difference.
Analysis
Model Size Is Not the Bottleneck
The 4B model is 1.9x faster per query than the 32B model, but the fundamental problem is unchanged: PageIndex makes one LLM call per document per query. With 18 documents and ~8 seconds per call, the minimum query time is ~144 seconds. No amount of model shrinking will bring this below ~2 minutes for an 18-document corpus, because the bottleneck is the number of sequential LLM calls, not the speed of each individual call.
To put it differently: even with a hypothetical model that responds instantly, PageIndex would still need to serialize 18 separate API calls. The architecture is O(N) in the number of documents, and reducing the constant factor from 30 seconds (32B) to 8 seconds (4B) helps but does not change the scaling class.
Smaller Models, Better Retrieval?
The counterintuitive finding – that the 4B model returned more relevant results – deserves scrutiny. Two hypotheses:
1. Lower confidence threshold: Smaller models tend to be less calibrated in their uncertainty. The 4B model may simply be more willing to say "yes, this is relevant" when the connection is indirect. For the WiFi/Firefox query, this worked brilliantly. For the chip defects/AI boom query, it was a stretch.
2. Qwen3.5 vs Qwen3 generation improvement: We are not just comparing model sizes – we are comparing model generations. Qwen3.5 (the 4B model) may have better instruction following than Qwen3 (the 32B model), producing more reliable structured JSON output and more consistent relevance judgments. The improvement may come from training data and architecture refinements, not from the parameter count.
We cannot cleanly separate these factors without testing Qwen3.5-32B (not available via Ollama at the time of testing). But the practical result is clear: for PageIndex's use case, the 4B model performed better on every dimension that matters.
The VRAM Dividend
The 4B model uses 3.4 GB of VRAM vs 20 GB for the 32B model. On our RTX 5090 (32 GB), this is the difference between leaving room for other GPU workloads and consuming the majority of available memory. In a production scenario where PageIndex runs alongside embedding models and other GPU-intensive services, the smaller model's footprint is a significant practical advantage.
Recommendations
Use the Smallest Model That Works
Our results suggest that PageIndex's retrieval quality does not scale with model size – at least not in the direction you might expect. The 4B model found more relevant documents, built trees faster, and used a fraction of the VRAM. Unless you have evidence that your specific documents require stronger reasoning capabilities, start small.
The Fundamental Constraint Remains
Even with the 4B model, PageIndex queries take 5+ minutes for 18 documents. The approach remains impractical for corpus-level search. The recommendation from our previous article still holds: use PageIndex for deep analysis of a small, pre-filtered document set, not as a primary search engine.
A Revised Hybrid Architecture
The 4B model's performance makes the hybrid approach more practical:
Stage 1: Vector search (pplx) → top 5 candidates in 45ms
Stage 2: PageIndex (4B) → deep analysis of 5 documents in ~40 seconds
Stage 1: Vector search (pplx) → top 5 candidates in 45ms
Stage 2: PageIndex (4B) → deep analysis of 5 documents in ~40 secondsWith only 5 documents to analyze, the 4B model's per-document time of ~8 seconds gives a Stage 2 latency of approximately 40 seconds. This is long for an interactive application but acceptable for a "deep search" feature or asynchronous analysis pipeline.
Compare this to the 32B model, where Stage 2 would take ~150 seconds for the same 5 documents. The 4B model makes the two-stage hybrid architecture 3-4x more responsive.
Conclusion
Swapping a 32B model for a 4B model in PageIndex improved everything we measured: ingest speed (3.7x faster), query speed (1.9x faster), VRAM usage (6x less), and – counterintuitively – retrieval quality (5/5 completed queries returned relevant results vs 1/7 with the 32B model).
The key findings:
- Model size is not PageIndex's bottleneck. The constraint is the number of sequential LLM calls per query (one per document), not the speed of each call. Shrinking the model helps but does not change the O(N) scaling.
- Smaller models can be more effective at structured retrieval tasks. The 4B model's lower relevance threshold produced more useful results for RAG applications, where returning loosely related content beats returning nothing.
- The WiFi/Firefox result demonstrates genuine semantic reasoning. PageIndex connected "Firefox privacy features" to a WiFi security article – a match that all three vector providers missed. This is the kind of reasoning that justifies the approach despite its speed limitations.
- VRAM savings enable practical deployment. 3.4 GB vs 20 GB of VRAM is the difference between running PageIndex alongside other GPU services and dedicating the entire GPU to it.
- The hybrid architecture becomes more viable. At ~8 seconds per document, a vector-then-PageIndex pipeline can analyze 5 pre-filtered candidates in under a minute – slow for real-time, but workable for deep analysis features.
The question we started with – does model size matter for vectorless RAG? – has a nuanced answer: yes, but in the opposite direction from what you might expect. For PageIndex's structured tree search tasks, a 4B parameter model outperformed a 32B model on both speed and quality. The LLM is not the bottleneck; the architecture is.
Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-4B via Ollama (3.4 GB), Milvus Standalone (vector providers), 18-document news corpus, 16 queries across 9 topic categories. PageIndex completed 5 of 16 queries within the 10-minute timeout.
Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.