How to Choose, Architect, and Operate the Semantic Search Infrastructure Behind RAG Pipelines, AI Agents, and Recommendation Engines
The vector database market grew from $1.6 billion in 2023 to a projected $10.6 billion by 2032 — a 23.5% CAGR driven almost entirely by the explosion of Retrieval-Augmented Generation (RAG) applications and enterprise AI deployments. Yet most engineering teams making vector database decisions in 2025 are choosing based on marketing materials rather than production evidence.
The core architectural insight:
The debate is not "vector database vs. relational database." Every production AI system uses both — relational stores for governance, transactions, and metadata; vector stores for semantic retrieval. The decision is where to draw the boundary between them, and for most companies, that boundary is closer to the relational side than vendor marketing suggests.
Before choosing a vector database, it is worth understanding exactly what problem the technology solves — and equally importantly, what it does not solve. Vector databases excel at one specific operation: given a query vector, find the N most similar vectors in the index. This "approximate nearest neighbor" search is the foundation of semantic retrieval.
Traditional databases can answer "find all documents where category = 'finance' and date > '2024-01-01'." They cannot answer "find the 10 documents most semantically similar to this query." Similarity search requires comparing a query vector against every indexed vector and returning the closest matches — an operation that does not map to SQL's equality and range predicates.
| Query Type | SQL Database | Vector Database |
|---|---|---|
| "Find user with id = 12345" | ✅ Primary key lookup — O(1) | ❌ Not designed for this |
| "Find all active users from last 30 days" | ✅ Index scan — efficient | ❌ No native temporal support |
| "Find the 10 most similar documents to this query" | ❌ Requires full table scan | ✅ HNSW search — O(log n) |
| "Find similar documents that are also from this tenant" | ❌ No vector distance | ✅ Pre-filtered ANN search |
| "Find documents matching both keywords AND semantic intent" | ⚠ Full-text only | ✅ Hybrid BM25 + vector |
RAG Applications
The dominant 2025 use case. Retrieve relevant context from a knowledge base before passing it to an LLM. Quality of retrieval determines quality of LLM response. Powers enterprise chatbots, documentation search, and support systems.
Semantic Search
Search by meaning, not keywords. "Running shoes for wide feet" finds "broad-fit athletic footwear" even with zero keyword overlap. Essential for product discovery, knowledge management, and code search.
Agent Memory
AI agents need episodic memory — the ability to recall relevant past interactions or knowledge. Vector search retrieves semantically relevant memories in sub-100ms, enabling agents to maintain coherent context across long sessions.
Recommendation Systems
Match users to items by embedding proximity. Unlike collaborative filtering on explicit ratings, behavioral embeddings capture preferences users never articulated. Powers Netflix-style "because you watched..." at scale.
Anomaly Detection
Flag transactions or events far from the expected cluster in embedding space. Vector distance identifies outliers more naturally than threshold-based rules for fraud detection and security monitoring.
Deduplication
Detect near-duplicate content at scale. Find documents that are semantically equivalent even when phrased differently — critical for data cleaning pipelines and content moderation systems.
An embedding is a dense numerical representation of a piece of content — text, image, audio, code — as a vector of floating-point numbers. The key property: semantically similar content produces geometrically nearby vectors. "The capital of France" and "Paris is the largest city in France" will have similar embeddings even though they share few words.
At query time, the user's question goes through the same embedding model to produce a query vector. The vector database then finds the stored vectors with the smallest distance (cosine similarity or dot product) to the query vector. This retrieval takes 1–100ms even across millions of chunks.
| Metric | Formula | When to Use | Notes |
|---|---|---|---|
| Cosine Similarity | cos(θ) = A·B / (‖A‖‖B‖) | Text embeddings, semantic search, RAG | Measures angle between vectors; robust to magnitude variation. Best default choice for text. |
| Dot Product | A·B = Σ(aᵢ × bᵢ) | Recommendation systems with pre-normalized vectors | Faster than cosine (no normalization step); requires unit-normalized vectors for meaningful comparison. |
| Euclidean (L2) | ‖A-B‖ = √Σ(aᵢ-bᵢ)² | Image embeddings, clustering tasks | Measures absolute distance in vector space. More sensitive to magnitude than cosine. Good for clustering. |
Hierarchical Navigable Small World (HNSW) is the index algorithm used by Pinecone, Qdrant, Weaviate, pgvector, and most other production vector databases. Understanding HNSW is essential for tuning index parameters for your workload.
HNSW builds a multi-layer graph where each layer is a navigable small world network. The top layers are sparse (few nodes, long-range connections for fast traversal) and the bottom layer is dense (all nodes, short-range connections for precise search). At query time, the algorithm starts at the top layer and greedily navigates toward the query vector, refining the search in each successive denser layer.
Key HNSW Parameters
Recall vs. Performance Trade-offs
| m | ef | Recall@10 | QPS |
|---|---|---|---|
| 8 | 64 | 95% | ~800 |
| 16 | 64 | 97% | ~500 |
| 32 | 128 | 99% | ~200 |
| 64 | 256 | 99.5% | ~80 |
Approximate values for 1M vectors, 1536 dimensions. Production benchmarks vary significantly by hardware and dataset.
Critical insight on recall benchmarks:
Performance benchmarks only mean something with a recall number attached. Comparing "10ms at 90% recall" to "50ms at 99% recall" is meaningless — they solve different problems. A RAG system at 95% recall misses 1 in 20 relevant documents. At 99%, it misses 1 in 100. That difference determines whether your AI application regularly provides incomplete context or almost never does.
The vector database landscape has over 20 options in 2025. Based on market adoption, production maturity, and the majority of enterprise use cases, six databases cover approximately 80% of production deployments. Here is an honest comparison — without vendor marketing.
✅ Choose When
⚠ Avoid When
✅ Choose When
⚠ Avoid When
✅ Choose When
⚠ Avoid When
✅ Choose When
⚠ Avoid When
May 2025 Benchmark (Timescale):
pgvectorscale with DiskANN + Statistical Binary Quantization: 471 QPS at 99% recall on 50M vectors — 11.4x better than Qdrant (41 QPS) at the same recall level. p95 latency 28x lower than Pinecone s1 at 99% recall. Cost savings: ~75% vs managed vector databases at comparable workloads.
✅ Choose When
⚠ Avoid When
✅ Choose When
⚠ Avoid When
After reviewing production AI deployments, four architecture patterns account for the vast majority of enterprise vector database implementations in 2025. Each pattern reflects a specific set of scale, complexity, and operational constraints.
The hybrid approach combines vector similarity (captures semantic intent) with keyword matching (captures exact terms, product names, error codes). Reciprocal Rank Fusion (RRF) merges the two result sets. This outperforms pure vector search in enterprise applications where queries often include specific technical terms, names, or identifiers.
For companies under 50M vectors running PostgreSQL, pgvector provides 80% of dedicated vector database performance at 0% of the additional operational cost. The 2025 Timescale benchmarks change the recommendation for early-stage companies: do not add a dedicated vector database until you have empirical evidence that pgvector cannot meet your SLAs.
Relational + Vector
PostgreSQL + pgvector + pgvectorscale
Object Storage
S3/GCS for source documents and model artifacts
Cache
Redis for embedding cache (avoid re-embedding identical text)
When pgvector hits throughput limits — typically at 100M+ vectors or high concurrent query load — the recommended progression is to separate the vector workload into a dedicated service while keeping the relational core in PostgreSQL.
| Component | Technology | Responsibility |
|---|---|---|
| Primary database | PostgreSQL | User data, permissions, audit logs, metadata |
| Vector search | Pinecone / Weaviate / Qdrant | ANN search over embedding index |
| Cache layer | Redis | Recent embeddings, query result cache, rate limiting |
| Object store | S3 / R2 | Source documents, model checkpoints, batch exports |
| Message queue | SQS / Pub/Sub | Embedding pipeline jobs, document ingestion queue |
A RAG pipeline has two phases: indexing (offline) and retrieval (online). Most production failures come from under-engineering the indexing phase, not the retrieval phase.
A simple improvement that consistently improves RAG retrieval quality: before embedding the user's query, use an LLM to generate 3-5 alternative phrasings. Embed all variations and merge the result sets before RRF. This compensates for the mismatch between how users phrase questions and how documents are written.
The embedding model is the single most impactful choice in a RAG system — more impactful than which vector database you use. A better embedding model will improve retrieval quality regardless of which database stores the vectors. A poor embedding model cannot be rescued by any database optimization.
| Model | Dimensions | Best For | Cost | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | General English text, RAG, semantic search | $0.02/1M tokens | Best price/performance ratio for English. Supports Matryoshka (reducible to 512 dims with minimal loss). |
| OpenAI text-embedding-3-large | 3072 | High-stakes retrieval where quality > cost | $0.13/1M tokens | Best OpenAI quality. 6.5x more expensive than small — verify improvement before upgrading. |
| Cohere embed-v3 | 1024 | Multi-language, enterprise, high throughput | ~$0.10/1M tokens | Strong multilingual support (100+ languages). Separate models for search vs. classification. |
| BGE-m3 (BAAI, open-source) | 1024 | Self-hosted deployments, cost control | Infrastructure only | State-of-art open-source. Supports dense, sparse, and multi-vector retrieval simultaneously. |
| Jina Embeddings v3 | 1024 | Long document embeddings (>8K tokens) | ~$0.02/1M tokens | Supports context windows up to 8192 tokens per chunk — reduces chunking complexity. |
You must use the exact same embedding model for indexing and querying. Mixing models (even different versions of the same model) produces vectors in incompatible spaces, causing retrieval to completely fail. Always store the model name + version in your embedding schema and re-index when upgrading models.
Chunking — how you split source documents before embedding — has a larger impact on RAG quality than most teams realize. The goal is to produce chunks that are semantically self-contained and small enough to fit in the embedding model's context window, while large enough to carry meaningful context.
| Strategy | Chunk Size | Best For | Trade-off |
|---|---|---|---|
| Fixed-size with overlap | 256–512 tokens, 10–15% overlap | General purpose, homogeneous document collections | Simple to implement; may split sentences/paragraphs mid-thought |
| Sentence-level | 1–5 sentences per chunk | FAQ databases, customer support documents | Preserves semantic boundaries; very small chunks may lack context |
| Semantic chunking | Variable (follows topic boundaries) | Long-form articles, research papers, documentation | Best quality; requires embedding-based boundary detection; higher indexing cost |
| Document hierarchy (parent-child) | Child: 128 tokens; Parent: 512 tokens | Documents with sections (APIs docs, legal texts) | Retrieve small chunks for precision, return parent for context; requires two index layers |
For most enterprise RAG applications: 512 tokens per chunk, 64-token overlap, sentence boundary preservation (do not split mid-sentence). This configuration works well across diverse document types and provides a stable baseline for A/B testing chunk size improvements.
Multi-tenant SaaS applications face unique challenges with vector databases: tenant data isolation, performance fairness (one large tenant should not slow others), and scaling economics (per-tenant index vs. shared index). The choice of isolation strategy has major architectural implications.
| Strategy | Implementation | Isolation Level | Scalability | Best For |
|---|---|---|---|---|
| Shared index with tenant_id filter | Pre-filter on tenant_id before ANN search | Software (database enforces) | Best for <1000 tenants, uneven sizes | Most SaaS apps; simplest to operate |
| Namespace per tenant | Pinecone namespaces, Qdrant collections | Logical separation in shared infrastructure | Good; check namespace limits (Pinecone: 20 indexes) | Mid-market SaaS; moderate tenant count |
| Database per tenant | Turso / separate pgvector per tenant | Complete isolation | Excellent for regulated industries | Healthcare, BFSI, government where hard isolation required |
ANN algorithms like HNSW traverse a graph of ALL vectors in the index, then apply post-filters. Without a pre-filter enforced at the database layer (like PostgreSQL RLS or Pinecone's namespace isolation), an ANN query can theoretically visit and score vectors from other tenants before the tenant filter is applied. Always enforce tenant isolation at the database layer, not just the application layer.
Vector quantization reduces memory footprint by representing each dimension with fewer bits. Modern quantization methods maintain high recall while dramatically reducing cost:
| Quantization Type | Memory Reduction | Recall Impact | Notes |
|---|---|---|---|
| None (float32) | Baseline | 100% | Full precision; highest memory cost |
| int8 (scalar quantization) | 75% reduction | >99% | Strong recall; Redis reports 99.99% retention |
| Binary quantization | 96% reduction | ~95% | Extreme compression; requires rescoring with full vectors |
| Product quantization (PQ) | ~90% reduction | 95–98% | FAISS standard; good balance; used in Pinecone serverless |
Embedding API calls are the primary variable cost in RAG systems. Caching embeddings for recently seen queries typically reduces API costs by 30–60% in production systems where queries exhibit power-law distribution (a small fraction of queries account for most traffic).
Use this decision framework to select the right vector database architecture for your specific situation. The most common mistake is over-engineering: teams choose dedicated vector databases before they have validated that PostgreSQL + pgvector cannot meet their requirements.
| Your Situation | Recommended Architecture | Why |
|---|---|---|
| Existing PostgreSQL, <50M vectors | pgvector + pgvectorscale | 471 QPS at 99% recall is sufficient; zero additional infra; lowest cost |
| Existing PostgreSQL, 50–100M vectors | pgvector + pgvectorscale (evaluate Pinecone if SLA not met) | Test with your actual workload before adding complexity |
| Greenfield, under 10M vectors, small team | Qdrant Cloud or Weaviate | Best developer experience; best free tier (Qdrant); minimal ops burden |
| Need hybrid search (vectors + keywords) | Weaviate | Native BM25 + vector fusion; best hybrid search implementation |
| 10–100M vectors, want managed + reliable | Pinecone | Zero ops; proven SLAs; best support; 7ms p99 |
| 100M+ vectors, cost-sensitive, ops expertise | Milvus self-hosted / Zilliz Cloud | 70%+ cost savings vs managed; scales to billions |
| Regulated industry, hard tenant isolation required | PostgreSQL RLS + pgvector or per-tenant databases | Database-native enforcement satisfies SOC 2 / HIPAA auditors |
Vector databases are not magic. They are specialized index structures optimized for one operation — approximate nearest neighbor search in high-dimensional space. Understanding this constraint is the key to making good architectural decisions: use vector databases where their specific capability is required, and avoid adding them to systems where relational databases can meet the requirement.
The 2025 benchmark landscape has shifted significantly. pgvectorscale's performance at 50M vectors has narrowed the gap between PostgreSQL extensions and dedicated vector databases to the point where the "start simple" advice is now backed by hard performance numbers. The threshold for adding a dedicated vector database has moved from 10M vectors to 100M vectors for most workloads.
The architecture decision tree in plain language:
Isaac Shi writes about AI, software, and entrepreneurship at isaacshi.com. These essays provide the strategic and philosophical context behind this thesis.