Comparing Embedding Models for RAG: Ollama/Qwen3 vs Perplexity pplx-embed: A Side-by-Side Evaluation

A real-world evaluation of three embedding providers on a diverse 72-document corpus, running locally on an NVIDIA RTX 5090.

When building a Retrieval-Augmented Generation (RAG) system, the embedding model is one of the most consequential decisions you will make. It determines how your documents are represented in vector space, and ultimately whether a user's query finds the right information or something tangentially related. Yet most embedding benchmarks evaluate models on curated academic datasets – not on the messy, heterogeneous documents that real systems have to handle.

We wanted to know: does your choice of embedding model actually change the results your RAG system returns? And if so, how?

To find out, we built a RAG pipeline that supports hot-swapping between three embedding providers, kept everything else identical, and ran the same 16 queries against the same 72-document corpus. This article presents what we found.

Note: All results in this article come from a real evaluation run against real documents, not synthetic benchmarks or cherry-picked examples.

The Setup

Architecture

Our RAG system follows a straightforward pipeline:

Document Upload --> Docling Extraction --> HybridChunker (512 tokens) --> Embedding --> Milvus
Query           --> Embedding --> Milvus COSINE Search --> Ranked Results

The key design decision: extraction and chunking are identical across all providers. Every document goes through the same Docling extractor and the same HybridChunker (using the Qwen/Qwen3-Embedding-0.6B tokenizer, 512-token max, merge_peers=True). The only variable is the embedding step. This gives us a clean comparison – any difference in retrieval quality is attributable to the embedding model alone.

Three Providers

We evaluated three embedding configurations, all producing 1024-dimensional float32 vectors:

1. ollama – Qwen3-Embedding-0.6B via Ollama

The Qwen3 embedding model served through Ollama's HTTP API. This is the simplest deployment option: pull the model, start Ollama, and call the /api/embed endpoint.

# Ollama embedding via the Python client
client = ollama.Client(host="http://localhost:11434")
response = client.embed(model="qwen3-embedding:0.6b", input=texts)
embeddings = response["embeddings"]

2. pplx – perplexity-ai/pplx-embed-v1-0.6B via SentenceTransformers

Perplexity's standard embedding model loaded directly into GPU memory via SentenceTransformers. No intermediate HTTP API – the model runs in the same Python process as the application.

# pplx embedding via SentenceTransformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("perplexity-ai/pplx-embed-v1-0.6B", trust_remote_code=True)
embeddings = model.encode(texts, convert_to_numpy=True).astype(np.float32).tolist()

3. pplx-ctx – perplexity-ai/pplx-embed-context-v1-0.6B (contextualized)

Perplexity's contextualized embedding model, which applies bidirectional attention across all chunks during ingestion. The idea is that each chunk's embedding becomes aware of its surrounding context – a technique sometimes called "late chunking." Crucially, at query time, both pplx and pplx-ctx use the standard pplx model for the query embedding. The contextualized model is only used during document ingestion.

def embed_query(query: str, provider: str = "ollama") -> list[float]:
    if provider in ("pplx", "pplx-ctx"):
        # Both use the standard model for queries
        return _pplx_embed_query(query)

Infrastructure

All three providers run locally on the same machine:

  • GPU: NVIDIA RTX 5090 (32 GB VRAM)
  • Vector DB: Milvus (Standalone, via Docker Compose)
  • Index: IVF_FLAT with nlist=128, COSINE metric, nprobe=16 at search time
  • Framework: FastAPI with per-provider Milvus collections (rag_ollama, rag_pplx, rag_pplx_ctx)

Each provider gets its own Milvus collection with an identical schema:

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="document_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="filename", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="metadata", dtype=DataType.VARCHAR, max_length=65535),
]

Test Methodology

Corpus Composition

We sampled 72 documents from a news corpus of 73,070 files (2 GB total), deliberately balancing across file types and topics:

Dimension Breakdown
File types 18 PDF, 18 DOCX, 18 PPTX, 18 Markdown
Topics 9 categories: world, technology, science, business, sports, entertainment, politics, opinion, health
Documents per topic 2 per file type per topic (8 documents per topic)
Source data size 15.67 MB total (10.03 MB DOCX, 3.50 MB PDF, 2.01 MB PPTX, 0.13 MB MD)

This gives us diversity in both content and format. PPTX files are particularly interesting because they tend to produce very short documents (1-2 chunks with minimal running text), while DOCX and PDF files produce longer, richer chunks. This is realistic: real-world RAG systems ingest messy data, not pristine article text.

Chunking

All 72 documents produced 470 chunks per provider (1,410 total across all three), using Docling's HybridChunker with the Qwen3 tokenizer:

chunker = HybridChunker(
    tokenizer="Qwen/Qwen3-Embedding-0.6B",
    max_tokens=512,
    merge_peers=True,
)

Upload Performance

Ingesting all 72 files into all three providers took 755 seconds (~12.6 minutes). This includes Docling extraction (which uses OCR for PDFs and parses PPTX/DOCX structures), chunking, embedding with each provider, and Milvus insertion. Extraction and chunking run once per document; only the embedding step is repeated per provider.

Query Design

We designed 16 queries spanning all 9 topic categories, ranging from specific factual queries ("MLS rule changes") to more abstract ones ("celestial hamburgers", "tech changing media"). Some queries have a clear correct document in the corpus; others are intentionally ambiguous to test how models handle edge cases.

Results

Storage Comparison

All three providers used identical storage: 2.343 MB per collection, or approximately 5,227 bytes per chunk.

Collection Chunks Storage Per Chunk
rag_ollama 470 2.343 MB 5,227 bytes
rag_pplx 470 2.343 MB 5,227 bytes
rag_pplx_ctx 470 2.343 MB 5,227 bytes
Total 1,410 7.028 MB --

This makes sense: the vectors are the same dimensionality (1024 x 4 bytes = 4,096 bytes per vector), and the text and metadata fields are identical across providers. Storage is dominated by the float32 vectors plus the stored text, not by any property of the embedding model itself.

For reference, the 7.028 MB of Milvus storage represents 15.67 MB of source data – roughly a 2.2:1 ratio. This is not really "compression" in the traditional sense; the Milvus collections store chunked text plus vectors, while the source files include images, formatting, and other non-text content.

Takeaway: At the same dimensionality, your choice of embedding model has zero impact on storage costs.

Retrieval Quality

Here is where things get interesting. The table below shows the top-1 result for each query across all three providers.

Important: Absolute scores are NOT directly comparable across models. Each model produces embeddings in a different vector space, so a score of 0.55 from Ollama and 0.55 from pplx do not imply the same degree of semantic similarity. What IS comparable is which documents are surfaced and the relative ranking quality.
Query Ollama pplx pplx-ctx Correct Doc Found?
AI military risks 0.618 0.513 0.466 All three found Anthropic/Pentagon articles
Silent chip defects 0.547 0.202 0.224 None correct -- Ollama returned Premier League, pplx/pplx-ctx also off-topic
Firefox privacy features 0.470 0.228 0.279 None at #1 -- Ollama got HackerNoon, pplx got Apple Watch, pplx-ctx got HackerNoon
iPad Air M4 chip 0.538 0.353 0.328 All found Apple Watch Ultra (iPad Air not in corpus; Apple Watch was closest)
Clouds in Antarctica 0.405 0.166 0.223 Ollama wrong (OpenClaw cloud VM!), pplx/pplx-ctx found Earth habitability
Waking up early 0.629 0.689 0.597 All correct -- pplx had highest score
Celestial hamburgers 0.460 0.190 0.282 Ollama wrong (OpenAI defense deal), pplx/pplx-ctx found closest space article
Inherited diseases 0.516 0.507 0.465 All correct
Ukraine war drones 0.485 0.312 0.295 Ollama wrong (OpenClaw), pplx/pplx-ctx found Anthropic/Pentagon articles
Iran threats 0.530 0.520 0.473 All correct
Trump AI executive orders 0.573 0.517 0.471 All found Anthropic articles
MLS rule changes 0.775 0.606 0.567 All correct
Warner Bros sale 0.585 0.528 0.505 Ollama found Netflix article; pplx/pplx-ctx found actual Warner Bros article
Banks data centers 0.662 0.684 0.640 All correct -- pplx had highest score
AI safety autonomous weapons 0.565 0.410 0.369 All found Anthropic articles
Tech changing media 0.535 0.285 0.345 Ollama/pplx-ctx found Warner Bros; pplx found journalism AI article

Average Scores

Provider Avg Top-1 Score
ollama 0.556
pplx 0.419
pplx-ctx 0.408

Query Latency

Provider Avg (ms) Min (ms) Max (ms)
ollama 56 48 72
pplx 33 21 60
pplx-ctx 32 20 102

Note that pplx and pplx-ctx use the same standard model at query time, hence their near-identical latency. The difference from Ollama is attributable to HTTP API overhead – Ollama requires a network round-trip even when running locally, while SentenceTransformers encodes directly in-process.

Analysis

The False Confidence Problem

The most striking finding is not about average scores – it is about failure modes.

Ollama/Qwen3 produced the highest average score (0.556 vs 0.419 for pplx), but it also produced the most false positives. In 4 out of 16 queries, Ollama's top result was from a completely unrelated document:

  • "Clouds in Antarctica" matched an article about OpenClaw's cloud VMs (keyword overlap: "cloud")
  • "Celestial hamburgers" matched an OpenAI defense deal article (unclear why)
  • "Ukraine war drones" matched OpenClaw again (possibly "drone" or military keyword overlap)
  • "Silent chip defects" matched a Premier League article (possibly metadata overlap)

In each case, Ollama returned these wrong results with relatively high confidence scores (0.40-0.55). This is the worst kind of failure in a RAG system: the model is not just wrong, it is confidently wrong. A downstream LLM will happily synthesize an answer from this irrelevant context, producing a fluent but incorrect response.

The pplx models, while returning lower absolute scores, tended to surface more semantically relevant content even when they did not find the exact right document. When the query was about "clouds in Antarctica," pplx models returned articles about Earth's habitability – wrong, but at least in the right domain (science/environment vs. cloud computing).

Keyword Sensitivity vs. Semantic Understanding

The pattern suggests that Qwen3 may be more sensitive to lexical overlap than pplx. The word "cloud" in a query about Antarctic weather matches "cloud" in a cloud computing article. The word "chip" in a query about semiconductor defects matches content about sports or technology metadata. This is a well-known failure mode of embedding models that lean too heavily on token-level features rather than compositional semantics.

The pplx models appear to perform more genuine semantic matching. On the two queries where pplx beat Ollama on score ("waking up early" and "banks data centers"), the queries closely matched the content of the target document rather than its title or metadata keywords. This suggests pplx is better at matching meaning rather than surface-level word patterns.

Similarly, for the "Warner Bros sale" query, Ollama's top result was a Netflix staffing article (entertainment industry keyword overlap), while both pplx models found the actual Warner Bros article.

The Contextualized Model Underperformed

We expected pplx-ctx to outperform standard pplx – after all, it uses bidirectional attention across chunks during ingestion, so each chunk's embedding should be enriched with document-level context. In practice, pplx-ctx performed roughly on par with or slightly below standard pplx (average score 0.408 vs 0.419).

Why? Our best hypothesis: the chunks were already fairly self-contained. With 512-token HybridChunker windows and merge_peers=True, each chunk carries enough internal context that the surrounding-chunk signal does not add much. The contextualized model may show greater benefits with:

  • Shorter chunks (128-256 tokens) where individual chunks lack self-sufficiency
  • Highly structured documents where meaning bleeds across section boundaries
  • Documents with extensive cross-references or coreference chains

With our relatively generous chunk size and a chunker that already merges adjacent small segments, the contextualization had little room to improve.

Provider Strengths and Weaknesses

Ollama / Qwen3-Embedding-0.6B

Strengths:

  • Highest absolute similarity scores (useful for threshold-based filtering)
  • Strong performance on queries with direct keyword match to content
  • Simplest deployment: ollama pull qwen3-embedding:0.6b and go
  • Broad model ecosystem – easy to swap for other Ollama-hosted models

Weaknesses:

  • More false positives due to keyword sensitivity
  • Higher query latency (56ms avg vs 32-33ms for pplx)
  • HTTP API overhead compared to in-process inference
  • Confidently wrong answers are dangerous in production RAG

pplx-embed-v1-0.6B

Strengths:

  • Better semantic matching for content-level queries
  • Fewer false positives – lower scores when unsure (more calibrated)
  • Faster query embedding (~40% faster than Ollama)
  • In-process inference, no network overhead

Weaknesses:

  • Lower absolute scores (may require adjusting similarity thresholds)
  • Requires SentenceTransformers and trust_remote_code=True
  • Model loaded into GPU memory alongside the application

pplx-embed-context-v1-0.6B

Strengths:

  • Document-aware embeddings via late chunking (theoretically richer representations)
  • Same query-time model as standard pplx (no additional latency at search time)

Weaknesses:

  • Marginal or no improvement over standard pplx in our test
  • Requires loading a second model for ingestion (additional GPU memory)
  • Benefits likely depend on chunk size and document structure

Opportunities for Improvement

This evaluation reveals several avenues worth exploring:

1. Matryoshka Dimensionality Reduction

Both pplx models support Matryoshka representations, meaning you can truncate the 1024-dim vectors to lower dimensions (e.g., 256 or 512) with graceful quality degradation. Our system already supports this:

# In config
pplx_truncate_dim: int | None = None  # Set to 256 or 512 for reduced storage

# In embedding
if settings.pplx_truncate_dim is not None:
    kwargs["truncate_dim"] = settings.pplx_truncate_dim

This could cut storage by 2-4x with minimal retrieval quality loss – a worthwhile trade-off for large-scale deployments.

2. Hybrid Search (Dense + Sparse)

None of these models handle keyword-specific queries well (e.g., "Firefox privacy features" was missed by all three). Combining dense vector search with sparse retrieval (BM25 or SPLADE) could capture both semantic similarity and exact keyword matches. Milvus supports hybrid search natively.

3. Re-ranking

A cross-encoder re-ranker applied to the top-k results from the initial vector search could significantly improve precision. Retrieve 20-50 candidates, then re-rank with a more expensive model that sees query and document together.

4. Chunk Size Tuning

Our 512-token chunks may be too large for some document types (especially PPTX files that produce 1-2 chunks of minimal text) and too small for others. Adaptive chunking based on document structure could improve results.

5. Larger Test Corpus and Evaluation Set

72 documents and 16 queries is enough to spot trends, but not enough for statistically rigorous conclusions. A larger evaluation with hundreds of queries and ground-truth relevance labels would give higher-confidence results.

6. Query Prefixes and Instruction Tuning

Some embedding models respond to query prefixes (e.g., "search_query: ..." vs "search_document: ..."). We did not use instruction prefixes in this evaluation; adding them might shift the results, particularly for Qwen3 which supports instruction-tuned embedding.

Recommendations

For prototyping and simple deployments: Start with Ollama/Qwen3. It is the easiest to set up, produces high scores that are easy to threshold, and works well for queries with strong keyword overlap with target documents.

For production systems where precision matters: Consider pplx-embed. Its lower false-positive rate and better semantic calibration make it more reliable when wrong answers have real consequences. The lower absolute scores are not a problem – they just mean your similarity thresholds need adjustment.

For the contextualized model: Hold off unless your use case specifically involves short chunks or highly cross-referential documents. At 512-token chunk sizes with merge_peers enabled, the standard pplx model performs just as well without the additional model loading overhead.

For best results: Do not rely on any single embedding model. Implement re-ranking, hybrid search, or a compare-and-select strategy. Our system supports multi-provider queries out of the box:

def compare_query(
    query_text: str,
    top_k: int = 5,
    providers: list[str] | None = None,
) -> CompareResponse:
    """Search the same query across multiple providers."""
    results: dict[str, SearchResponse] = {}
    for provider in providers:
        results[provider] = query(query_text, top_k=top_k, provider=provider)
    return CompareResponse(query=query_text, results=results)

This lets you run the same query against all providers and choose the best result – useful during evaluation and potentially in production for high-stakes queries.

Conclusion

Choosing an embedding model for RAG is not just about benchmark leaderboard scores. Our evaluation reveals that the model with the highest similarity scores (Ollama/Qwen3) also produced the most false positives, while the model with lower scores (pplx-embed) delivered more semantically accurate retrieval.

The key findings:

  1. Higher scores do not mean better retrieval. Ollama averaged 0.556 vs pplx's 0.419, but Ollama returned wrong documents with high confidence in 25% of queries.
  2. Keyword sensitivity is a real risk. Models that lean on lexical overlap will match "cloud computing" when you ask about "clouds in Antarctica." In a RAG system, this means your LLM receives irrelevant context and produces confidently wrong answers.
  3. Contextualized embeddings need the right conditions. Late chunking via pplx-ctx showed no measurable benefit with 512-token chunks. The technique likely requires shorter chunks or more context-dependent documents to demonstrate its value.
  4. Storage is a non-factor. At the same dimensionality, all models use identical storage. Choose based on retrieval quality, not storage cost.
  5. Latency favors in-process inference. SentenceTransformers (32ms) outperformed Ollama's HTTP API (56ms) by roughly 40% for single-query embedding.

The embedding model is just one component of a RAG pipeline, but it is the component that determines what your system retrieves. Choose carefully, evaluate on your own data, and do not trust synthetic benchmarks over real-world testing.


Evaluated on: NVIDIA RTX 5090 (32 GB VRAM), Milvus Standalone, 72-document news corpus (PDF/DOCX/PPTX/MD), 470 chunks per provider, 16 queries across 9 topic categories.

Disclaimer: Yes, I have used AI to help with the structure and flow my write up - I'm an engineer after all! The tests, knowledge and findings are all real-world research and can be verified and reproduced on-demand.