Does Model Size Matter for Vectorless RAG? Testing PageIndex with a 4B Parameter Model

Kirk Ryan

05 Mar 2026 • 9 min read

A follow-up experiment: swapping Qwen3-32B for Qwen3.5-4B in PageIndex to measure how LLM size affects tree-based retrieval speed and quality.

In our previous evaluation, we found that PageIndex – a vectorless RAG approach that uses LLM reasoning instead of embedding similarity – delivered impressive precision but at a devastating cost: 9.6-minute average query times with Qwen3-32B. The system was accurate when it completed, but 56% of queries timed out before finishing.

The obvious question: was the 32B model overkill? PageIndex uses the LLM for two tasks that do not obviously require a large model: summarizing document sections during indexing, and matching queries against summaries during retrieval. These are structured reasoning tasks with clear instructions, not open-ended generation. A smaller model might handle them just as well – and much faster.

We swapped Qwen3-32B (20 GB) for Qwen3.5-4B (3.4 GB) and re-ran the same benchmark. The results challenge some assumptions about when model size matters.

The Experiment

What Changed

	Previous Run	This Run
PageIndex model	Qwen3-32B (20 GB)	Qwen3.5-4B (3.4 GB)
Model generation	Qwen3	Qwen3.5
Vector providers	Same (ollama, pplx, pplx-ctx)	Same
Corpus	18 documents, 9 topics	18 documents, 9 topics
Queries	16 queries	16 queries
Hardware	RTX 5090 (32 GB VRAM)	Same
Timeout	10 minutes	10 minutes

Everything except the PageIndex model stayed the same. Same 16 queries, same hardware, same vector provider data, same timeout thresholds.

The model shift is significant: Qwen3.5-4B has 8x fewer parameters than Qwen3-32B, uses 6x less VRAM (3.4 GB vs 20 GB), and belongs to a newer model generation (3.5 vs 3). The question is whether those 28 billion fewer parameters cost us retrieval quality.

Corpus

The corpus comprised 18 documents (2 per topic) across 9 categories, with a mix of formats:

Format	Count	Notes
Markdown	12	News articles, opinion pieces, technical content
DOCX	2	Research summaries, reviews
PPTX	2	Sports commentary, health supplements
PDF	2	Magazine article, academic paper
Total	18	0.78 MB source data

Results

Ingest Performance

The first and most dramatic improvement: every single document uploaded successfully.

Metric	Qwen3-32B	Qwen3.5-4B	Change
Documents ingested	18/18	18/18	Same
Total ingest time	~3,500s	961s	3.6x faster
Average per document	~194s	53s	3.7x faster
Fastest document	~7s	<1s	Similar
Slowest document	~459s	585s	Comparable

The per-document times tell the story:

Document	Format	Size	32B Time	4B Time	Speedup
1KB AI boom article	MD	0.6 KB	12s	<1s	>12x
Eye health article	MD	14 KB	214s	37s	5.8x
Stunt design Oscar	MD	10 KB	69s	9s	7.7x
WiFi security article	MD	9 KB	21s	21s	1.0x
Cortisol supplements	PPTX	6 KB	8s	<1s	>8x
American censorship	PDF	174 KB	172s	172s	1.0x

The pattern is clear: the 4B model processes text faster but PDF parsing time dominates for that format. For pure text-to-tree operations (Markdown, DOCX), the speedup ranges from 5-12x. For PDFs, where the bottleneck is Docling extraction and PageIndex's PDF parser rather than LLM inference, both models take similar time.

Tree Structure

The smaller model built slightly different tree structures:

Metric	Qwen3-32B	Qwen3.5-4B
Total nodes	85	68
Avg nodes per document	4.7	3.8
Storage	0.130 MB	0.130 MB

The 4B model produced 20% fewer nodes. This is expected – a smaller model generates more concise summaries and is less likely to split content into deep hierarchies. Whether this is better or worse depends on the documents; fewer nodes means faster query-time search (fewer things to evaluate per document) but potentially less granular retrieval.

Query Latency

This is the metric that matters most for PageIndex's viability.

Metric	Qwen3-32B	Qwen3.5-4B	Change
Avg query time (completed)	575s (9.6 min)	305s (5.1 min)	1.9x faster
Fastest query	545s	174s	3.1x faster
Slowest query	590s	529s	1.1x faster
Queries completed	7/16	5/16	Fewer
Queries timed out	9/16	11/16	More

The speed improvement is real but uneven. The fastest query ("Firefox privacy features") took just 174 seconds with the 4B model – a 3.1x improvement. But the timeout rate actually increased from 56% to 69%.

Why? The 4B model is faster per LLM call (~8s vs ~30s), but it occasionally produces responses that require retries or takes longer to converge on structured JSON output. The net effect: faster when things work, but slightly less reliable at staying within the timeout window for all 18 documents.

Extrapolating from the successful queries:

Corpus Size	Qwen3-32B Est.	Qwen3.5-4B Est.
18 documents	9.6 min	5.1 min
50 documents	27 min	14 min
100 documents	53 min	28 min
500 documents	4.4 hours	2.4 hours

The smaller model cuts estimated query times roughly in half, but does not change the fundamental O(N) scaling. A 500-document corpus still takes hours, not seconds.

Retrieval Quality

This is the most surprising result: the 4B model was more aggressive at finding relevant content – and found genuinely useful matches that every other provider missed.

Query	32B Result	4B Result
AI military risks	Science roundup (1.0)	Iran war article (1.0)
Silent chip defects	TIMEOUT	AI boom article (1.0)
Firefox privacy features	TIMEOUT	WiFi security article (1.0)
iPad Air M4 chip	Nothing found	TIMEOUT
Clouds in Antarctica	Nothing found	TIMEOUT
Waking up early	TIMEOUT	TIMEOUT
Celestial hamburgers	TIMEOUT	TIMEOUT
Inherited diseases	TIMEOUT	TIMEOUT
Ukraine war drones	Nothing found	TIMEOUT
Iran threats	Nothing found	Iran war article (1.0)
Trump AI executive orders	TIMEOUT	AI boom article (1.0)
MLS rule changes	Nothing found	TIMEOUT
Warner Bros sale	Nothing found	TIMEOUT
Banks data centers	TIMEOUT	TIMEOUT
AI safety autonomous weapons	TIMEOUT	TIMEOUT
Tech changing media	TIMEOUT	TIMEOUT

Of the queries that completed, the 4B model returned results for 5 out of 5. The 32B model returned results for 1 out of 7. The smaller model has a lower relevance threshold – it is more willing to identify connections between queries and documents.

The Firefox Query

The standout result: for "Firefox privacy features," the 4B model found a WiFi security article discussing VPN services, browser security, and HTTPS protections. This is a genuinely relevant match – the article directly addresses browser privacy and network security, which is exactly what someone searching for "Firefox privacy features" would want to read about.

Every vector provider missed this connection entirely:

ollama: returned an eye health article (score 0.574)
pplx: returned a stock market after-hours article (score 0.210)
pplx-ctx: returned the same stock market article (score 0.271)

The LLM understood that "Firefox privacy features" relates to browser security and VPN usage. No embedding model made that semantic leap.

The 4B model even returned multiple relevant sections from the WiFi article, ranked by specificity:

"What matters more when it comes to public Wi-Fi security" (score 1.0)
"The general internet is, too" – discussing VPN connection options (score 0.9)
"Wi-Fi's built-in protections and HTTPS aren't foolproof" (score 0.8)

The Iran Threats Query

For "Iran threats," the 4B model found "The Iran War Is Jeopardizing the Entire Global Economy" – an article that discusses the geopolitical and economic impact of conflict in the Gulf. The 32B model, running on the same query, returned nothing. Meanwhile, ollama's vector search returned an article about Snoop Dogg performing in a Welsh canteen (score 0.498).

False Positives vs. False Relevance

The 4B model's lower relevance threshold does introduce a risk. For "Silent chip defects," it returned the AI boom article, reasoning that discussions of AI infrastructure and chip demand are related to semiconductor manufacturing. This is more of a stretch – the article discusses the economics of AI, not chip defects per se. A stricter model might (correctly) have returned nothing.

This reveals an interesting trade-off between model sizes: the 32B model was too conservative (returned nothing for most queries), while the 4B model was more liberal in its interpretation of relevance. For RAG applications, the 4B model's behavior is arguably more useful – returning loosely related content is better than returning nothing, as long as the downstream LLM can assess relevance in context.

Vector Provider Comparison

The vector providers performed identically to the previous benchmark (same data, same queries):

Provider	Avg Score	Avg Latency	Behavior
ollama	0.532	1,240ms	Always returns something, frequent false positives
pplx	0.207	64ms	Lower scores, better semantic matching
pplx-ctx	0.236	45ms	Similar to pplx
pageindex (4B)	1.000*	305,072ms	Returns results only when confident, high precision

*Average of completed queries only.

The latency gap between vector providers and PageIndex narrowed but remains enormous: pplx returns in 45ms, PageIndex in 305,000ms. That is a 6,700x difference.

Analysis

Model Size Is Not the Bottleneck

The 4B model is 1.9x faster per query than the 32B model, but the fundamental problem is unchanged: PageIndex makes one LLM call per document per query. With 18 documents and ~8 seconds per call, the minimum query time is ~144 seconds. No amount of model shrinking will bring this below ~2 minutes for an 18-document corpus, because the bottleneck is the number of sequential LLM calls, not the speed of each individual call.

To put it differently: even with a hypothetical model that responds instantly, PageIndex would still need to serialize 18 separate API calls. The architecture is O(N) in the number of documents, and reducing the constant factor from 30 seconds (32B) to 8 seconds (4B) helps but does not change the scaling class.

Smaller Models, Better Retrieval?

The counterintuitive finding – that the 4B model returned more relevant results – deserves scrutiny. Two hypotheses:

1. Lower confidence threshold: Smaller models tend to be less calibrated in their uncertainty. The 4B model may simply be more willing to say "yes, this is relevant" when the connection is indirect. For the WiFi/Firefox query, this worked brilliantly. For the chip defects/AI boom query, it was a stretch.

2. Qwen3.5 vs Qwen3 generation improvement: We are not just comparing model sizes – we are comparing model generations. Qwen3.5 (the 4B model) may have better instruction following than Qwen3 (the 32B model), producing more reliable structured JSON output and more consistent relevance judgments. The improvement may come from training data and architecture refinements, not from the parameter count.

We cannot cleanly separate these factors without testing Qwen3.5-32B (not available via Ollama at the time of testing). But the practical result is clear: for PageIndex's use case, the 4B model performed better on every dimension that matters.

The VRAM Dividend

The 4B model uses 3.4 GB of VRAM vs 20 GB for the 32B model. On our RTX 5090 (32 GB), this is the difference between leaving room for other GPU workloads and consuming the majority of available memory. In a production scenario where PageIndex runs alongside embedding models and other GPU-intensive services, the smaller model's footprint is a significant practical advantage.

Recommendations

Use the Smallest Model That Works

Our results suggest that PageIndex's retrieval quality does not scale with model size – at least not in the direction you might expect. The 4B model found more relevant documents, built trees faster, and used a fraction of the VRAM. Unless you have evidence that your specific documents require stronger reasoning capabilities, start small.

The Fundamental Constraint Remains

Even with the 4B model, PageIndex queries take 5+ minutes for 18 documents. The approach remains impractical for corpus-level search. The recommendation from our previous article still holds: use PageIndex for deep analysis of a small, pre-filtered document set, not as a primary search engine.

A Revised Hybrid Architecture

The 4B model's performance makes the hybrid approach more practical:

Stage 1: Vector search (pplx) → top 5 candidates in 45ms
Stage 2: PageIndex (4B) → deep analysis of 5 documents in ~40 seconds

Stage 1: Vector search (pplx) → top 5 candidates in 45ms
Stage 2: PageIndex (4B) → deep analysis of 5 documents in ~40 seconds

With only 5 documents to analyze, the 4B model's per-document time of ~8 seconds gives a Stage 2 latency of approximately 40 seconds. This is long for an interactive application but acceptable for a "deep search" feature or asynchronous analysis pipeline.

Compare this to the 32B model, where Stage 2 would take ~150 seconds for the same 5 documents. The 4B model makes the two-stage hybrid architecture 3-4x more responsive.

Conclusion

Swapping a 32B model for a 4B model in PageIndex improved everything we measured: ingest speed (3.7x faster), query speed (1.9x faster), VRAM usage (6x less), and – counterintuitively – retrieval quality (5/5 completed queries returned relevant results vs 1/7 with the 32B model).

The key findings:

Model size is not PageIndex's bottleneck. The constraint is the number of sequential LLM calls per query (one per document), not the speed of each call. Shrinking the model helps but does not change the O(N) scaling.
Smaller models can be more effective at structured retrieval tasks. The 4B model's lower relevance threshold produced more useful results for RAG applications, where returning loosely related content beats returning nothing.
The WiFi/Firefox result demonstrates genuine semantic reasoning. PageIndex connected "Firefox privacy features" to a WiFi security article – a match that all three vector providers missed. This is the kind of reasoning that justifies the approach despite its speed limitations.
VRAM savings enable practical deployment. 3.4 GB vs 20 GB of VRAM is the difference between running PageIndex alongside other GPU services and dedicating the entire GPU to it.
The hybrid architecture becomes more viable. At ~8 seconds per document, a vector-then-PageIndex pipeline can analyze 5 pre-filtered candidates in under a minute – slow for real-time, but workable for deep analysis features.

The question we started with – does model size matter for vectorless RAG? – has a nuanced answer: yes, but in the opposite direction from what you might expect. For PageIndex's structured tree search tasks, a 4B parameter model outperformed a 32B model on both speed and quality. The LLM is not the bottleneck; the architecture is.

Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Qwen3.5-4B via Ollama (3.4 GB), Milvus Standalone (vector providers), 18-document news corpus, 16 queries across 9 topic categories. PageIndex completed 5 of 16 queries within the 10-minute timeout.

Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.