Enterprise Search Types
When building or evaluating search for enterprise file and object stores, there are several distinct paradigms to understand:
Lexical Search (Keyword / Full-Text)
Matches documents based on exact or near-exact term occurrence. Uses inverted indexes (e.g., Elasticsearch, Solr, Lucene).
- How it works: Tokenizes text, applies stemming/stop-word removal, scores via BM25 or TF-IDF
- Strengths: Fast, deterministic, great for exact identifiers (contract numbers, SKUs, file names)
- Weaknesses: Fails on synonyms, paraphrasing, or intent — "car" won't match "automobile"
- Best for: Log search, compliance search, known-term lookup
Semantic Search (Vector / Embedding-based)
Converts text into high-dimensional vector embeddings and finds documents by cosine similarity or ANN (approximate nearest neighbor).
- How it works: Encoder model (e.g., BGE, E5, OpenAI Ada) embeds queries and documents; results ranked by vector proximity in a store like Milvus, Qdrant, or pgvector
- Strengths: Understands intent and meaning, handles synonyms, multilingual queries
- Weaknesses: Can surface semantically close but contextually irrelevant results; less precise on exact terms
- Best for: Natural language queries, knowledge discovery, "find docs like this"
Hybrid Search
Combines lexical and semantic scores, typically via reciprocal rank fusion (RRF) or weighted blending.
- How it works: Run both pipelines in parallel, merge ranked result lists
- Strengths: Best of both worlds — handles exact terms and conceptual intent
- Weaknesses: More infrastructure complexity; tuning the blend ratio requires experimentation
- Best for: General-purpose enterprise search where query patterns are unpredictable
Faceted / Structured Search
Filters results using metadata attributes rather than content — think taxonomy-driven navigation.
- How it works: Pre-indexed metadata fields (owner, date, file type, department, classification label) applied as filter constraints
- Strengths: Highly precise, deterministic, respects data governance boundaries
- Weaknesses: Requires rich, consistent metadata; doesn't help with content discovery
- Best for: Document management systems, DAMs, compliance portals
Graph / Relational Search
Traverses relationships between entities — files linked to projects, authors, or topics.
- How it works: Knowledge graph or property graph (Neo4j, Neptune) stores entity relationships; queries traverse edges
- Strengths: Surfaces non-obvious connections ("all files touched by this contractor related to Project X")
- Weaknesses: Expensive to build and maintain; requires entity extraction pipeline
- Best for: Legal discovery, M&A due diligence, knowledge graph-augmented RAG
ACL-Aware / Permission-Filtered Search
Not a search modality per se, but a critical constraint layer in enterprise contexts.
- How it works: Search results are filtered post-retrieval (or pre-indexed) against the querying user's permissions — group memberships, file ACLs, sensitivity labels
- Why it matters: Without this, semantic or lexical search can leak sensitive documents across trust boundaries
- Implementation: Can be enforced at index time (separate indexes per group) or query time (post-filter with identity context)
- Best for: Any multi-tenant or role-segmented environment — which is essentially all enterprise deployments
Summary Table
| Type | Signal Used | Precision | Recall | Infrastructure |
|---|---|---|---|---|
| Lexical | Exact terms | High | Low | Elasticsearch / Solr |
| Semantic | Meaning / vectors | Medium | High | Milvus / pgvector |
| Hybrid | Both | High | High | Combined stack |
| Faceted | Metadata | Very High | Low | Any indexed store |
| Graph | Relationships | Contextual | Variable | Neo4j / Neptune |
| ACL-Aware | Identity + permissions | — | — | IAM + any of above |