THOR THUNDERSCAN · THESIS AI Infrastructure

Vector Databases and Embedding Architecture for Enterprise AI

How to Choose, Architect, and Operate the Semantic Search Infrastructure Behind RAG Pipelines, AI Agents, and Recommendation Engines

Isaac Shi, Co-Founder & GP of Golden Section VC
October 9, 2025
Share on LinkedIn Share on X
Database Schema Design for AI-Ready Systems All Articles Database Design Anti-Patterns: God Tables & EAV

Executive Summary

The vector database market grew from $1.6 billion in 2023 to a projected $10.6 billion by 2032 — a 23.5% CAGR driven almost entirely by the explosion of Retrieval-Augmented Generation (RAG) applications and enterprise AI deployments. Yet most engineering teams making vector database decisions in 2025 are choosing based on marketing materials rather than production evidence.

The core architectural insight:

The debate is not "vector database vs. relational database." Every production AI system uses both — relational stores for governance, transactions, and metadata; vector stores for semantic retrieval. The decision is where to draw the boundary between them, and for most companies, that boundary is closer to the relational side than vendor marketing suggests.

$10.6B
projected vector database market by 2032 — up from $1.6B in 2023 (SNS Insider, 23.5% CAGR)
471 QPS
pgvectorscale at 99% recall on 50M vectors (Timescale, May 2025)
Hybrid
Every production AI system uses both vector and relational stores — the debate is only where to draw the boundary

What Vector Databases Actually Solve

Before choosing a vector database, it is worth understanding exactly what problem the technology solves — and equally importantly, what it does not solve. Vector databases excel at one specific operation: given a query vector, find the N most similar vectors in the index. This "approximate nearest neighbor" search is the foundation of semantic retrieval.

Where SQL Cannot Go

Traditional databases can answer "find all documents where category = 'finance' and date > '2024-01-01'." They cannot answer "find the 10 documents most semantically similar to this query." Similarity search requires comparing a query vector against every indexed vector and returning the closest matches — an operation that does not map to SQL's equality and range predicates.

Query TypeSQL DatabaseVector Database
"Find user with id = 12345"✅ Primary key lookup — O(1)❌ Not designed for this
"Find all active users from last 30 days"✅ Index scan — efficient❌ No native temporal support
"Find the 10 most similar documents to this query"❌ Requires full table scan✅ HNSW search — O(log n)
"Find similar documents that are also from this tenant"❌ No vector distance✅ Pre-filtered ANN search
"Find documents matching both keywords AND semantic intent"⚠ Full-text only✅ Hybrid BM25 + vector

The Seven Production Use Cases

RAG Applications

The dominant 2025 use case. Retrieve relevant context from a knowledge base before passing it to an LLM. Quality of retrieval determines quality of LLM response. Powers enterprise chatbots, documentation search, and support systems.

Semantic Search

Search by meaning, not keywords. "Running shoes for wide feet" finds "broad-fit athletic footwear" even with zero keyword overlap. Essential for product discovery, knowledge management, and code search.

Agent Memory

AI agents need episodic memory — the ability to recall relevant past interactions or knowledge. Vector search retrieves semantically relevant memories in sub-100ms, enabling agents to maintain coherent context across long sessions.

Recommendation Systems

Match users to items by embedding proximity. Unlike collaborative filtering on explicit ratings, behavioral embeddings capture preferences users never articulated. Powers Netflix-style "because you watched..." at scale.

Anomaly Detection

Flag transactions or events far from the expected cluster in embedding space. Vector distance identifies outliers more naturally than threshold-based rules for fraud detection and security monitoring.

Deduplication

Detect near-duplicate content at scale. Find documents that are semantically equivalent even when phrased differently — critical for data cleaning pipelines and content moderation systems.

Milky Way over a mountain range — infinite points of light mapping an unseen geometry

Every star in the Milky Way has coordinates. Every word in a sentence has a vector. Both systems let you measure proximity in spaces too vast to navigate by hand.

How Embeddings Work: The Foundation

An embedding is a dense numerical representation of a piece of content — text, image, audio, code — as a vector of floating-point numbers. The key property: semantically similar content produces geometrically nearby vectors. "The capital of France" and "Paris is the largest city in France" will have similar embeddings even though they share few words.

From Text to Vectors: The Pipeline

Source Text
"Database schema affects AI model quality"
Chunking
512-token chunks with 50-token overlap
Embedding Model
text-embedding-3-small (1536 dims)
Vector
[0.023, -0.891, 0.445, ...] × 1536
HNSW Index
Navigable graph for O(log n) search

At query time, the user's question goes through the same embedding model to produce a query vector. The vector database then finds the stored vectors with the smallest distance (cosine similarity or dot product) to the query vector. This retrieval takes 1–100ms even across millions of chunks.

Distance Metrics: Choosing the Right One

MetricFormulaWhen to UseNotes
Cosine Similarity cos(θ) = A·B / (‖A‖‖B‖) Text embeddings, semantic search, RAG Measures angle between vectors; robust to magnitude variation. Best default choice for text.
Dot Product A·B = Σ(aᵢ × bᵢ) Recommendation systems with pre-normalized vectors Faster than cosine (no normalization step); requires unit-normalized vectors for meaningful comparison.
Euclidean (L2) ‖A-B‖ = √Σ(aᵢ-bᵢ)² Image embeddings, clustering tasks Measures absolute distance in vector space. More sensitive to magnitude than cosine. Good for clustering.

HNSW: The Algorithm Powering Modern Vector Search

Hierarchical Navigable Small World (HNSW) is the index algorithm used by Pinecone, Qdrant, Weaviate, pgvector, and most other production vector databases. Understanding HNSW is essential for tuning index parameters for your workload.

How HNSW Works

HNSW builds a multi-layer graph where each layer is a navigable small world network. The top layers are sparse (few nodes, long-range connections for fast traversal) and the bottom layer is dense (all nodes, short-range connections for precise search). At query time, the algorithm starts at the top layer and greedily navigates toward the query vector, refining the search in each successive denser layer.

Key HNSW Parameters

CREATE INDEX idx_hnsw ON chunks USING hnsw (embedding) WITH ( m = 16, -- connections per node ef_construction = 64 -- build quality );
  • m (default 16): edges per node. Higher = better recall, larger index
  • ef_construction (default 64): build quality. Higher = slower build, better recall
  • ef_search: query quality. Higher = slower query, better recall

Recall vs. Performance Trade-offs

mefRecall@10QPS
86495%~800
166497%~500
3212899%~200
6425699.5%~80

Approximate values for 1M vectors, 1536 dimensions. Production benchmarks vary significantly by hardware and dataset.

Critical insight on recall benchmarks:

Performance benchmarks only mean something with a recall number attached. Comparing "10ms at 90% recall" to "50ms at 99% recall" is meaningless — they solve different problems. A RAG system at 95% recall misses 1 in 20 relevant documents. At 99%, it misses 1 in 100. That difference determines whether your AI application regularly provides incomplete context or almost never does.

Vector Database Comparison: The Six That Matter

The vector database landscape has over 20 options in 2025. Based on market adoption, production maturity, and the majority of enterprise use cases, six databases cover approximately 80% of production deployments. Here is an honest comparison — without vendor marketing.

Pinecone
Best managed solution · Zero operational overhead · Production-proven at billions of vectors
Type: Fully managed, serverless
Performance: ~5–10ms p99 at small scale; higher at production vector counts
Pricing: $0.33/GB storage + ops; free tier

✅ Choose When

  • • Building commercial AI products, need zero ops
  • • Team lacks database operational expertise
  • • Need SLA guarantees and enterprise support
  • • Time-to-market matters more than cost

⚠ Avoid When

  • • Tight budget (>10M vectors gets expensive fast)
  • • Need full infrastructure control
  • • Already have PostgreSQL with pgvector capability
  • • Vendor lock-in is a concern
Milvus / Zilliz Cloud
Best open-source option · 40K+ GitHub stars · Proven at billions of vectors
Type: Open-source (Apache 2.0)
Performance: Single-digit ms, sub-30ms p95
Pricing: Free (infra costs); Zilliz managed from $99/mo

✅ Choose When

  • • Billion-scale vector needs
  • • Have strong data engineering capacity
  • • Cost-sensitivity with large datasets (saves 70%+ vs managed)
  • • Need maximum infrastructure control

⚠ Avoid When

  • • Small team or early-stage startup
  • • Under 10M vectors (over-engineered)
  • • No Kubernetes operational experience
  • • Need fastest time to production
Weaviate
Best hybrid search · Exceptional documentation · Native BM25 + vector fusion
Type: Open-source + managed cloud
Performance: Sub-100ms for RAG at <50M vectors
Pricing: OSS free; Cloud $25/mo after 14-day trial

✅ Choose When

  • • Need hybrid search (vectors + keywords + filters)
  • • Building RAG systems under 50M vectors
  • • Value excellent documentation for fast POC
  • • Want modular architecture (swap embedding models)

⚠ Avoid When

  • • Need absolute maximum throughput
  • • Scale above 100M vectors
  • • Very tight budget (14-day trial limit)
  • • Prefer REST API over GraphQL
Qdrant
Best free tier · Rust-native efficiency · Excellent filtering at moderate scale
Type: Open-source + managed
Performance: 1ms p99 (small), 626 QPS at 1M vectors
Pricing: 1GB free forever; $25/mo paid

✅ Choose When

  • • Budget-conscious (best free tier in market)
  • • Need complex metadata filtering
  • • Under 50M vectors
  • • Edge or on-device deployment needed

⚠ Avoid When

  • • Above 50M vectors (performance degrades)
  • • High concurrent write workloads
  • • Need largest ecosystem / community support
pgvector + pgvectorscale
Best for PostgreSQL shops · 471 QPS at 99% recall on 50M vectors · Zero additional infrastructure
Type: PostgreSQL extensions
Performance: 471 QPS at 99% recall, 50M vectors
Pricing: Free (existing PostgreSQL infra)

May 2025 Benchmark (Timescale):

pgvectorscale with DiskANN + Statistical Binary Quantization: 471 QPS at 99% recall on 50M vectors — 11.4x better than Qdrant (41 QPS) at the same recall level. p95 latency 28x lower than Pinecone s1 at 99% recall. Cost savings: ~75% vs managed vector databases at comparable workloads.

✅ Choose When

  • • Already running PostgreSQL (most B2B SaaS)
  • • Need vectors alongside relational data in same queries
  • • Under 100M vectors
  • • Strong cost-efficiency requirement
  • • Want to reduce system complexity

⚠ Avoid When

  • • Above 100M vectors (architectural limits)
  • • Pure vector workload at very high throughput
  • • No PostgreSQL expertise on team
  • • ORM doesn't support pgvector (check Prisma gaps)
Elasticsearch
Best for existing Elastic users · Battle-tested reliability · Unified search + vectors
Type: Search engine + vector
Performance: ~260ms exact kNN, sub-50ms with ANN+quantization
Pricing: Elastic Cloud / self-hosted

✅ Choose When

  • • Already running Elasticsearch for search/logging
  • • Need traditional search + semantic in one system
  • • Value decade of production operational maturity

⚠ Avoid When

  • • Pure vector workload (specialized DBs win)
  • • Greenfield project (too much overhead)
  • • Cost-sensitive (Elastic Cloud is expensive)

Enterprise Architecture Patterns

After reviewing production AI deployments, four architecture patterns account for the vast majority of enterprise vector database implementations in 2025. Each pattern reflects a specific set of scale, complexity, and operational constraints.

Pattern 1: Hybrid RAG (The Standard — 80% of Enterprise RAG)

┌─────────────────────────────────────────────────────┐ │ HYBRID RAG ARCHITECTURE │ └─────────────────────────────────────────────────────┘ User Query │ ▼ Embedding API (text-embedding-3-small) │ ├──▶ pgvector / Pinecone ← ANN search (semantic) │ Returns top-K chunk IDs + scores │ ├──▶ PostgreSQL (BM25) ← Full-text search (keyword) │ Returns top-K document IDs + BM25 scores │ ▼ Reciprocal Rank Fusion ← Combines semantic + keyword scores │ ▼ Top-K Chunks Retrieved from PostgreSQL (full text) │ ▼ LLM API (with retrieved context injected into prompt) │ ▼ Response to User

The hybrid approach combines vector similarity (captures semantic intent) with keyword matching (captures exact terms, product names, error codes). Reciprocal Rank Fusion (RRF) merges the two result sets. This outperforms pure vector search in enterprise applications where queries often include specific technical terms, names, or identifiers.

Pattern 2: Startup-Optimal (PostgreSQL + pgvector)

For companies under 50M vectors running PostgreSQL, pgvector provides 80% of dedicated vector database performance at 0% of the additional operational cost. The 2025 Timescale benchmarks change the recommendation for early-stage companies: do not add a dedicated vector database until you have empirical evidence that pgvector cannot meet your SLAs.

✓ THE STARTUP STACK (RECOMMENDED FOR <50M VECTORS)

Relational + Vector

PostgreSQL + pgvector + pgvectorscale

Object Storage

S3/GCS for source documents and model artifacts

Cache

Redis for embedding cache (avoid re-embedding identical text)

Pattern 3: Growth-Scale Architecture (10M–500M Vectors)

When pgvector hits throughput limits — typically at 100M+ vectors or high concurrent query load — the recommended progression is to separate the vector workload into a dedicated service while keeping the relational core in PostgreSQL.

ComponentTechnologyResponsibility
Primary databasePostgreSQLUser data, permissions, audit logs, metadata
Vector searchPinecone / Weaviate / QdrantANN search over embedding index
Cache layerRedisRecent embeddings, query result cache, rate limiting
Object storeS3 / R2Source documents, model checkpoints, batch exports
Message queueSQS / Pub/SubEmbedding pipeline jobs, document ingestion queue
Ocean waves breaking on shore — constant retrieval, constant flow

A RAG pipeline is like the tide — it retreats into vast storage and returns only what the moment calls for, reliably, every time.

Production RAG Pipeline Design

A RAG pipeline has two phases: indexing (offline) and retrieval (online). Most production failures come from under-engineering the indexing phase, not the retrieval phase.

Indexing Phase Architecture

INDEXING PIPELINE (runs on document ingestion and updates) Source Document (PDF, Markdown, HTML, database record) │ ▼ 1. EXTRACT Parse to plain text; strip HTML/PDF artifacts │ ▼ 2. CHUNK Split into 512-token chunks, 64-token overlap │ Store chunk_index + parent_document_id │ ▼ 3. ENRICH Add metadata: source, date, author, section, tenant_id │ Compute content_hash for staleness detection │ ▼ 4. EMBED Call embedding API (batch for cost efficiency) │ Store embedding_model version for migration tracking │ ▼ 5. INDEX Upsert into vector store with all metadata │ Update last_embedded_at in source record │ ▼ 6. VALIDATE Spot-check retrieval quality on test queries Alert if embedding quality degraded

Retrieval Phase: The Query Expansion Pattern

A simple improvement that consistently improves RAG retrieval quality: before embedding the user's query, use an LLM to generate 3-5 alternative phrasings. Embed all variations and merge the result sets before RRF. This compensates for the mismatch between how users phrase questions and how documents are written.

// Query expansion pattern (TypeScript) async function expandedRetrieval(userQuery: string, tenantId: string) { // Step 1: Generate query variants const variants = await llm.generateVariants(userQuery, 3); const allQueries = [userQuery, ...variants]; // Step 2: Embed all variants in parallel const embeddings = await Promise.all( allQueries.map(q => embedText(q)) ); // Step 3: Search for each variant with tenant filter const resultSets = await Promise.all( embeddings.map(emb => vectorSearch(emb, { tenantId, k: 20 })) ); // Step 4: Merge via Reciprocal Rank Fusion return reciprocalRankFusion(resultSets, k = 10); }

Choosing Embedding Models

The embedding model is the single most impactful choice in a RAG system — more impactful than which vector database you use. A better embedding model will improve retrieval quality regardless of which database stores the vectors. A poor embedding model cannot be rescued by any database optimization.

ModelDimensionsBest ForCostNotes
OpenAI text-embedding-3-small 1536 General English text, RAG, semantic search $0.02/1M tokens Best price/performance ratio for English. Supports Matryoshka (reducible to 512 dims with minimal loss).
OpenAI text-embedding-3-large 3072 High-stakes retrieval where quality > cost $0.13/1M tokens Best OpenAI quality. 6.5x more expensive than small — verify improvement before upgrading.
Cohere embed-v3 1024 Multi-language, enterprise, high throughput ~$0.10/1M tokens Strong multilingual support (100+ languages). Separate models for search vs. classification.
BGE-m3 (BAAI, open-source) 1024 Self-hosted deployments, cost control Infrastructure only State-of-art open-source. Supports dense, sparse, and multi-vector retrieval simultaneously.
Jina Embeddings v3 1024 Long document embeddings (>8K tokens) ~$0.02/1M tokens Supports context windows up to 8192 tokens per chunk — reduces chunking complexity.
⚠ CRITICAL: Embedding Model Consistency

You must use the exact same embedding model for indexing and querying. Mixing models (even different versions of the same model) produces vectors in incompatible spaces, causing retrieval to completely fail. Always store the model name + version in your embedding schema and re-index when upgrading models.

Chunking Strategy: The Underestimated Variable

Chunking — how you split source documents before embedding — has a larger impact on RAG quality than most teams realize. The goal is to produce chunks that are semantically self-contained and small enough to fit in the embedding model's context window, while large enough to carry meaningful context.

StrategyChunk SizeBest ForTrade-off
Fixed-size with overlap 256–512 tokens, 10–15% overlap General purpose, homogeneous document collections Simple to implement; may split sentences/paragraphs mid-thought
Sentence-level 1–5 sentences per chunk FAQ databases, customer support documents Preserves semantic boundaries; very small chunks may lack context
Semantic chunking Variable (follows topic boundaries) Long-form articles, research papers, documentation Best quality; requires embedding-based boundary detection; higher indexing cost
Document hierarchy (parent-child) Child: 128 tokens; Parent: 512 tokens Documents with sections (APIs docs, legal texts) Retrieve small chunks for precision, return parent for context; requires two index layers
✓ RECOMMENDED STARTING CONFIGURATION

For most enterprise RAG applications: 512 tokens per chunk, 64-token overlap, sentence boundary preservation (do not split mid-sentence). This configuration works well across diverse document types and provides a stable baseline for A/B testing chunk size improvements.

Multi-Tenant Vector Architecture

Multi-tenant SaaS applications face unique challenges with vector databases: tenant data isolation, performance fairness (one large tenant should not slow others), and scaling economics (per-tenant index vs. shared index). The choice of isolation strategy has major architectural implications.

StrategyImplementationIsolation LevelScalabilityBest For
Shared index with tenant_id filter Pre-filter on tenant_id before ANN search Software (database enforces) Best for <1000 tenants, uneven sizes Most SaaS apps; simplest to operate
Namespace per tenant Pinecone namespaces, Qdrant collections Logical separation in shared infrastructure Good; check namespace limits (Pinecone: 20 indexes) Mid-market SaaS; moderate tenant count
Database per tenant Turso / separate pgvector per tenant Complete isolation Excellent for regulated industries Healthcare, BFSI, government where hard isolation required
⚠ APPROXIMATE NEAREST NEIGHBOR AND TENANT ISOLATION

ANN algorithms like HNSW traverse a graph of ALL vectors in the index, then apply post-filters. Without a pre-filter enforced at the database layer (like PostgreSQL RLS or Pinecone's namespace isolation), an ANN query can theoretically visit and score vectors from other tenants before the tenant filter is applied. Always enforce tenant isolation at the database layer, not just the application layer.

Performance Optimization: Beyond Index Tuning

Quantization: Shrink Without Sacrifice

Vector quantization reduces memory footprint by representing each dimension with fewer bits. Modern quantization methods maintain high recall while dramatically reducing cost:

Quantization TypeMemory ReductionRecall ImpactNotes
None (float32)Baseline100%Full precision; highest memory cost
int8 (scalar quantization)75% reduction>99%Strong recall; Redis reports 99.99% retention
Binary quantization96% reduction~95%Extreme compression; requires rescoring with full vectors
Product quantization (PQ)~90% reduction95–98%FAISS standard; good balance; used in Pinecone serverless

Embedding Cache: Eliminate Redundant API Calls

Embedding API calls are the primary variable cost in RAG systems. Caching embeddings for recently seen queries typically reduces API costs by 30–60% in production systems where queries exhibit power-law distribution (a small fraction of queries account for most traffic).

// Embedding cache pattern (Redis + TTL) async function cachedEmbed(text: string): Promise<number[]> { const cacheKey = `embed:${sha256(text)}`; // Check cache first (Redis GET) const cached = await redis.get(cacheKey); if (cached) return JSON.parse(cached); // Cache miss: call embedding API const embedding = await embeddingAPI.embed(text); // Cache for 24 hours (embeddings are deterministic for same model) await redis.setex(cacheKey, 86400, JSON.stringify(embedding)); return embedding; }

Decision Framework: Choosing the Right Architecture

Use this decision framework to select the right vector database architecture for your specific situation. The most common mistake is over-engineering: teams choose dedicated vector databases before they have validated that PostgreSQL + pgvector cannot meet their requirements.

Your SituationRecommended ArchitectureWhy
Existing PostgreSQL, <50M vectors pgvector + pgvectorscale 471 QPS at 99% recall is sufficient; zero additional infra; lowest cost
Existing PostgreSQL, 50–100M vectors pgvector + pgvectorscale (evaluate Pinecone if SLA not met) Test with your actual workload before adding complexity
Greenfield, under 10M vectors, small team Qdrant Cloud or Weaviate Best developer experience; best free tier (Qdrant); minimal ops burden
Need hybrid search (vectors + keywords) Weaviate Native BM25 + vector fusion; best hybrid search implementation
10–100M vectors, want managed + reliable Pinecone Zero ops; proven SLAs; best support; 7ms p99
100M+ vectors, cost-sensitive, ops expertise Milvus self-hosted / Zilliz Cloud 70%+ cost savings vs managed; scales to billions
Regulated industry, hard tenant isolation required PostgreSQL RLS + pgvector or per-tenant databases Database-native enforcement satisfies SOC 2 / HIPAA auditors

Conclusion

Vector databases are not magic. They are specialized index structures optimized for one operation — approximate nearest neighbor search in high-dimensional space. Understanding this constraint is the key to making good architectural decisions: use vector databases where their specific capability is required, and avoid adding them to systems where relational databases can meet the requirement.

The 2025 benchmark landscape has shifted significantly. pgvectorscale's performance at 50M vectors has narrowed the gap between PostgreSQL extensions and dedicated vector databases to the point where the "start simple" advice is now backed by hard performance numbers. The threshold for adding a dedicated vector database has moved from 10M vectors to 100M vectors for most workloads.

The architecture decision tree in plain language:

  • 1. If you already run PostgreSQL and have under 100M vectors: start with pgvector. Measure. Upgrade if needed.
  • 2. If you need hybrid search: add Weaviate or use pgvector's combined GIN + HNSW approach.
  • 3. If you need managed and reliable above 100M vectors: Pinecone for zero-ops, Milvus for cost-control.
  • 4. The embedding model matters more than the database. Invest in evaluating and upgrading your embedding model before tuning index parameters.
  • 5. Chunking strategy matters more than people expect. 512 tokens with 64-token overlap is the right default for most applications. Measure retrieval quality, then optimize.

Sources & Further Reading

  1. FireCrawl. (2025). Best Vector Databases of 2025: A Comparative Analysis.
  2. Timescale. (2025). pgvectorscale Benchmarks: PostgreSQL vs. Pinecone for Vector Data.
  3. NexAI. (2025). Vector vs. Relational Databases: Designing for AI.
  4. World Journal of Advanced Engineering Technology and Sciences. (2025). Data Modeling for AI Systems.
  5. Pinecone. (2025). Vector Database: What It Is & How It Works.

Further Reading from the Author

Isaac Shi writes about AI, software, and entrepreneurship at isaacshi.com. These essays provide the strategic and philosophical context behind this thesis.

Essay · Isaac Shi
Montezuma's Revenge
Curiosity as humanity's secret algorithm — why semantic search outperforms keyword lookup, and why exploring meaning beats chasing keywords.
Essay · Isaac Shi
Silicon Cambrian Explosion
Why the explosion of AI intelligence types — including vector-native systems — is the most historically significant moment since the biological Cambrian Explosion.
Found this useful?
Share with your team and help more engineers design AI-scale vector infrastructure.
Share on LinkedIn Share on X Download PDF
Continue Reading
© 2026 Thor ThunderScan  ·  ← Back to Thesis  ·  Start Scanning →