Vector Databases and Embedding Architecture for Enterprise AI

Executive Summary

The vector database market grew from $1.6 billion in 2023 to a projected $10.6 billion by 2032 — a 23.5% CAGR driven almost entirely by the explosion of Retrieval-Augmented Generation (RAG) applications and enterprise AI deployments. Yet most engineering teams making vector database decisions in 2025 are choosing based on marketing materials rather than production evidence.

The core architectural insight:

The debate is not "vector database vs. relational database." Every production AI system uses both — relational stores for governance, transactions, and metadata; vector stores for semantic retrieval. The decision is where to draw the boundary between them, and for most companies, that boundary is closer to the relational side than vendor marketing suggests.

$10.6B

projected vector database market by 2032 — up from $1.6B in 2023 (SNS Insider, 23.5% CAGR)

471 QPS

pgvectorscale at 99% recall on 50M vectors (Timescale, May 2025)

Hybrid

Every production AI system uses both vector and relational stores — the debate is only where to draw the boundary

What Vector Databases Actually Solve

Before choosing a vector database, it is worth understanding exactly what problem the technology solves — and equally importantly, what it does not solve. Vector databases excel at one specific operation: given a query vector, find the N most similar vectors in the index. This "approximate nearest neighbor" search is the foundation of semantic retrieval.

Where SQL Cannot Go

Traditional databases can answer "find all documents where category = 'finance' and date > '2024-01-01'." They cannot answer "find the 10 documents most semantically similar to this query." Similarity search requires comparing a query vector against every indexed vector and returning the closest matches — an operation that does not map to SQL's equality and range predicates.

Query Type	SQL Database	Vector Database
"Find user with id = 12345"	✅ Primary key lookup — O(1)	❌ Not designed for this
"Find all active users from last 30 days"	✅ Index scan — efficient	❌ No native temporal support
"Find the 10 most similar documents to this query"	❌ Requires full table scan	✅ HNSW search — O(log n)
"Find similar documents that are also from this tenant"	❌ No vector distance	✅ Pre-filtered ANN search
"Find documents matching both keywords AND semantic intent"	⚠ Full-text only	✅ Hybrid BM25 + vector

The Seven Production Use Cases

RAG Applications

The dominant 2025 use case. Retrieve relevant context from a knowledge base before passing it to an LLM. Quality of retrieval determines quality of LLM response. Powers enterprise chatbots, documentation search, and support systems.

Semantic Search

Search by meaning, not keywords. "Running shoes for wide feet" finds "broad-fit athletic footwear" even with zero keyword overlap. Essential for product discovery, knowledge management, and code search.

Agent Memory

AI agents need episodic memory — the ability to recall relevant past interactions or knowledge. Vector search retrieves semantically relevant memories in sub-100ms, enabling agents to maintain coherent context across long sessions.

Recommendation Systems

Match users to items by embedding proximity. Unlike collaborative filtering on explicit ratings, behavioral embeddings capture preferences users never articulated. Powers Netflix-style "because you watched..." at scale.

Anomaly Detection

Flag transactions or events far from the expected cluster in embedding space. Vector distance identifies outliers more naturally than threshold-based rules for fraud detection and security monitoring.

Deduplication

Detect near-duplicate content at scale. Find documents that are semantically equivalent even when phrased differently — critical for data cleaning pipelines and content moderation systems.

How Embeddings Work: The Foundation

An embedding is a dense numerical representation of a piece of content — text, image, audio, code — as a vector of floating-point numbers. The key property: semantically similar content produces geometrically nearby vectors. "The capital of France" and "Paris is the largest city in France" will have similar embeddings even though they share few words.

From Text to Vectors: The Pipeline

Source Text

"Database schema affects AI model quality"

→

Chunking

512-token chunks with 50-token overlap

→

Embedding Model

text-embedding-3-small (1536 dims)

→

Vector

[0.023, -0.891, 0.445, ...] × 1536

→

HNSW Index

Navigable graph for O(log n) search

At query time, the user's question goes through the same embedding model to produce a query vector. The vector database then finds the stored vectors with the smallest distance (cosine similarity or dot product) to the query vector. This retrieval takes 1–100ms even across millions of chunks.

Distance Metrics: Choosing the Right One

Metric	Formula	When to Use	Notes
Cosine Similarity	cos(θ) = A·B / (‖A‖‖B‖)	Text embeddings, semantic search, RAG	Measures angle between vectors; robust to magnitude variation. Best default choice for text.
Dot Product	A·B = Σ(aᵢ × bᵢ)	Recommendation systems with pre-normalized vectors	Faster than cosine (no normalization step); requires unit-normalized vectors for meaningful comparison.
Euclidean (L2)	‖A-B‖ = √Σ(aᵢ-bᵢ)²	Image embeddings, clustering tasks	Measures absolute distance in vector space. More sensitive to magnitude than cosine. Good for clustering.

HNSW: The Algorithm Powering Modern Vector Search

Hierarchical Navigable Small World (HNSW) is the index algorithm used by Pinecone, Qdrant, Weaviate, pgvector, and most other production vector databases. Understanding HNSW is essential for tuning index parameters for your workload.

How HNSW Works

HNSW builds a multi-layer graph where each layer is a navigable small world network. The top layers are sparse (few nodes, long-range connections for fast traversal) and the bottom layer is dense (all nodes, short-range connections for precise search). At query time, the algorithm starts at the top layer and greedily navigates toward the query vector, refining the search in each successive denser layer.

Key HNSW Parameters

CREATE INDEX idx_hnsw
  ON chunks USING hnsw (embedding)
  WITH (
    m = 16,             -- connections per node
    ef_construction = 64 -- build quality
  );

m (default 16): edges per node. Higher = better recall, larger index
ef_construction (default 64): build quality. Higher = slower build, better recall
ef_search: query quality. Higher = slower query, better recall

Recall vs. Performance Trade-offs

m	ef	Recall@10	QPS
8	64	95%	~800
16	64	97%	~500
32	128	99%	~200
64	256	99.5%	~80

Approximate values for 1M vectors, 1536 dimensions. Production benchmarks vary significantly by hardware and dataset.

Critical insight on recall benchmarks:

Performance benchmarks only mean something with a recall number attached. Comparing "10ms at 90% recall" to "50ms at 99% recall" is meaningless — they solve different problems. A RAG system at 95% recall misses 1 in 20 relevant documents. At 99%, it misses 1 in 100. That difference determines whether your AI application regularly provides incomplete context or almost never does.

Vector Database Comparison: The Six That Matter

The vector database landscape has over 20 options in 2025. Based on market adoption, production maturity, and the majority of enterprise use cases, six databases cover approximately 80% of production deployments. Here is an honest comparison — without vendor marketing.

Pinecone

Best managed solution · Zero operational overhead · Production-proven at billions of vectors

Type: Fully managed, serverless

Performance: ~5–10ms p99 at small scale; higher at production vector counts

Pricing: $0.33/GB storage + ops; free tier

✅ Choose When

• Building commercial AI products, need zero ops
• Team lacks database operational expertise
• Need SLA guarantees and enterprise support
• Time-to-market matters more than cost

⚠ Avoid When

• Tight budget (>10M vectors gets expensive fast)
• Need full infrastructure control
• Already have PostgreSQL with pgvector capability
• Vendor lock-in is a concern

Milvus / Zilliz Cloud

Best open-source option · 40K+ GitHub stars · Proven at billions of vectors

Type: Open-source (Apache 2.0)

Performance: Single-digit ms, sub-30ms p95

Pricing: Free (infra costs); Zilliz managed from $99/mo

✅ Choose When

• Billion-scale vector needs
• Have strong data engineering capacity
• Cost-sensitivity with large datasets (saves 70%+ vs managed)
• Need maximum infrastructure control

⚠ Avoid When

• Small team or early-stage startup
• Under 10M vectors (over-engineered)
• No Kubernetes operational experience
• Need fastest time to production

Weaviate

Best hybrid search · Exceptional documentation · Native BM25 + vector fusion

Type: Open-source + managed cloud

Performance: Sub-100ms for RAG at <50M vectors

Pricing: OSS free; Cloud $25/mo after 14-day trial

✅ Choose When

• Need hybrid search (vectors + keywords + filters)
• Building RAG systems under 50M vectors
• Value excellent documentation for fast POC
• Want modular architecture (swap embedding models)

⚠ Avoid When

• Need absolute maximum throughput
• Scale above 100M vectors
• Very tight budget (14-day trial limit)
• Prefer REST API over GraphQL

Qdrant

Best free tier · Rust-native efficiency · Excellent filtering at moderate scale

Type: Open-source + managed

Performance: 1ms p99 (small), 626 QPS at 1M vectors

Pricing: 1GB free forever; $25/mo paid

✅ Choose When

• Budget-conscious (best free tier in market)
• Need complex metadata filtering
• Under 50M vectors
• Edge or on-device deployment needed

⚠ Avoid When

• Above 50M vectors (performance degrades)
• High concurrent write workloads
• Need largest ecosystem / community support

pgvector + pgvectorscale

Best for PostgreSQL shops · 471 QPS at 99% recall on 50M vectors · Zero additional infrastructure

Type: PostgreSQL extensions

Performance: 471 QPS at 99% recall, 50M vectors

Pricing: Free (existing PostgreSQL infra)

May 2025 Benchmark (Timescale):

pgvectorscale with DiskANN + Statistical Binary Quantization: 471 QPS at 99% recall on 50M vectors — 11.4x better than Qdrant (41 QPS) at the same recall level. p95 latency 28x lower than Pinecone s1 at 99% recall. Cost savings: ~75% vs managed vector databases at comparable workloads.

✅ Choose When

• Already running PostgreSQL (most B2B SaaS)
• Need vectors alongside relational data in same queries
• Under 100M vectors
• Strong cost-efficiency requirement
• Want to reduce system complexity

⚠ Avoid When

• Above 100M vectors (architectural limits)
• Pure vector workload at very high throughput
• No PostgreSQL expertise on team
• ORM doesn't support pgvector (check Prisma gaps)

Elasticsearch

Best for existing Elastic users · Battle-tested reliability · Unified search + vectors

Type: Search engine + vector

Performance: ~260ms exact kNN, sub-50ms with ANN+quantization

Pricing: Elastic Cloud / self-hosted

✅ Choose When

• Already running Elasticsearch for search/logging
• Need traditional search + semantic in one system
• Value decade of production operational maturity

⚠ Avoid When

• Pure vector workload (specialized DBs win)
• Greenfield project (too much overhead)
• Cost-sensitive (Elastic Cloud is expensive)

Enterprise Architecture Patterns

After reviewing production AI deployments, four architecture patterns account for the vast majority of enterprise vector database implementations in 2025. Each pattern reflects a specific set of scale, complexity, and operational constraints.

Pattern 1: Hybrid RAG (The Standard — 80% of Enterprise RAG)

┌─────────────────────────────────────────────────────┐
│                  HYBRID RAG ARCHITECTURE             │
└─────────────────────────────────────────────────────┘

User Query
     │
     ▼
Embedding API (text-embedding-3-small)
     │
     ├──▶ pgvector / Pinecone    ← ANN search (semantic)
     │       Returns top-K chunk IDs + scores
     │
     ├──▶ PostgreSQL (BM25)      ← Full-text search (keyword)
     │       Returns top-K document IDs + BM25 scores
     │
     ▼
Reciprocal Rank Fusion    ← Combines semantic + keyword scores
     │
     ▼
Top-K Chunks Retrieved from PostgreSQL (full text)
     │
     ▼
LLM API (with retrieved context injected into prompt)
     │
     ▼
Response to User

The hybrid approach combines vector similarity (captures semantic intent) with keyword matching (captures exact terms, product names, error codes). Reciprocal Rank Fusion (RRF) merges the two result sets. This outperforms pure vector search in enterprise applications where queries often include specific technical terms, names, or identifiers.

Pattern 2: Startup-Optimal (PostgreSQL + pgvector)

For companies under 50M vectors running PostgreSQL, pgvector provides 80% of dedicated vector database performance at 0% of the additional operational cost. The 2025 Timescale benchmarks change the recommendation for early-stage companies: do not add a dedicated vector database until you have empirical evidence that pgvector cannot meet your SLAs.

✓ THE STARTUP STACK (RECOMMENDED FOR <50M VECTORS)

Relational + Vector

PostgreSQL + pgvector + pgvectorscale

Object Storage

S3/GCS for source documents and model artifacts

Cache

Redis for embedding cache (avoid re-embedding identical text)

Pattern 3: Growth-Scale Architecture (10M–500M Vectors)

When pgvector hits throughput limits — typically at 100M+ vectors or high concurrent query load — the recommended progression is to separate the vector workload into a dedicated service while keeping the relational core in PostgreSQL.

Component	Technology	Responsibility
Primary database	PostgreSQL	User data, permissions, audit logs, metadata
Vector search	Pinecone / Weaviate / Qdrant	ANN search over embedding index
Cache layer	Redis	Recent embeddings, query result cache, rate limiting
Object store	S3 / R2	Source documents, model checkpoints, batch exports
Message queue	SQS / Pub/Sub	Embedding pipeline jobs, document ingestion queue

Production RAG Pipeline Design

A RAG pipeline has two phases: indexing (offline) and retrieval (online). Most production failures come from under-engineering the indexing phase, not the retrieval phase.

Indexing Phase Architecture

INDEXING PIPELINE (runs on document ingestion and updates)

Source Document (PDF, Markdown, HTML, database record)
     │
     ▼
1. EXTRACT        Parse to plain text; strip HTML/PDF artifacts
     │
     ▼
2. CHUNK          Split into 512-token chunks, 64-token overlap
     │              Store chunk_index + parent_document_id
     │
     ▼
3. ENRICH         Add metadata: source, date, author, section, tenant_id
     │              Compute content_hash for staleness detection
     │
     ▼
4. EMBED          Call embedding API (batch for cost efficiency)
     │              Store embedding_model version for migration tracking
     │
     ▼
5. INDEX          Upsert into vector store with all metadata
     │              Update last_embedded_at in source record
     │
     ▼
6. VALIDATE       Spot-check retrieval quality on test queries
                   Alert if embedding quality degraded

Retrieval Phase: The Query Expansion Pattern

A simple improvement that consistently improves RAG retrieval quality: before embedding the user's query, use an LLM to generate 3-5 alternative phrasings. Embed all variations and merge the result sets before RRF. This compensates for the mismatch between how users phrase questions and how documents are written.

// Query expansion pattern (TypeScript)
async function expandedRetrieval(userQuery: string, tenantId: string) {
  // Step 1: Generate query variants
  const variants = await llm.generateVariants(userQuery, 3);
  const allQueries = [userQuery, ...variants];

  // Step 2: Embed all variants in parallel
  const embeddings = await Promise.all(
    allQueries.map(q => embedText(q))
  );

  // Step 3: Search for each variant with tenant filter
  const resultSets = await Promise.all(
    embeddings.map(emb => vectorSearch(emb, { tenantId, k: 20 }))
  );

  // Step 4: Merge via Reciprocal Rank Fusion
  return reciprocalRankFusion(resultSets, k = 10);
}

Choosing Embedding Models

The embedding model is the single most impactful choice in a RAG system — more impactful than which vector database you use. A better embedding model will improve retrieval quality regardless of which database stores the vectors. A poor embedding model cannot be rescued by any database optimization.

Model	Dimensions	Best For	Cost	Notes
OpenAI text-embedding-3-small	1536	General English text, RAG, semantic search	$0.02/1M tokens	Best price/performance ratio for English. Supports Matryoshka (reducible to 512 dims with minimal loss).
OpenAI text-embedding-3-large	3072	High-stakes retrieval where quality > cost	$0.13/1M tokens	Best OpenAI quality. 6.5x more expensive than small — verify improvement before upgrading.
Cohere embed-v3	1024	Multi-language, enterprise, high throughput	~$0.10/1M tokens	Strong multilingual support (100+ languages). Separate models for search vs. classification.
BGE-m3 (BAAI, open-source)	1024	Self-hosted deployments, cost control	Infrastructure only	State-of-art open-source. Supports dense, sparse, and multi-vector retrieval simultaneously.
Jina Embeddings v3	1024	Long document embeddings (>8K tokens)	~$0.02/1M tokens	Supports context windows up to 8192 tokens per chunk — reduces chunking complexity.

⚠ CRITICAL: Embedding Model Consistency

You must use the exact same embedding model for indexing and querying. Mixing models (even different versions of the same model) produces vectors in incompatible spaces, causing retrieval to completely fail. Always store the model name + version in your embedding schema and re-index when upgrading models.

Chunking Strategy: The Underestimated Variable

Chunking — how you split source documents before embedding — has a larger impact on RAG quality than most teams realize. The goal is to produce chunks that are semantically self-contained and small enough to fit in the embedding model's context window, while large enough to carry meaningful context.

Strategy	Chunk Size	Best For	Trade-off
Fixed-size with overlap	256–512 tokens, 10–15% overlap	General purpose, homogeneous document collections	Simple to implement; may split sentences/paragraphs mid-thought
Sentence-level	1–5 sentences per chunk	FAQ databases, customer support documents	Preserves semantic boundaries; very small chunks may lack context
Semantic chunking	Variable (follows topic boundaries)	Long-form articles, research papers, documentation	Best quality; requires embedding-based boundary detection; higher indexing cost
Document hierarchy (parent-child)	Child: 128 tokens; Parent: 512 tokens	Documents with sections (APIs docs, legal texts)	Retrieve small chunks for precision, return parent for context; requires two index layers

✓ RECOMMENDED STARTING CONFIGURATION

For most enterprise RAG applications: 512 tokens per chunk, 64-token overlap, sentence boundary preservation (do not split mid-sentence). This configuration works well across diverse document types and provides a stable baseline for A/B testing chunk size improvements.

Multi-Tenant Vector Architecture

Multi-tenant SaaS applications face unique challenges with vector databases: tenant data isolation, performance fairness (one large tenant should not slow others), and scaling economics (per-tenant index vs. shared index). The choice of isolation strategy has major architectural implications.

Strategy	Implementation	Isolation Level	Scalability	Best For
Shared index with tenant_id filter	Pre-filter on tenant_id before ANN search	Software (database enforces)	Best for <1000 tenants, uneven sizes	Most SaaS apps; simplest to operate
Namespace per tenant	Pinecone namespaces, Qdrant collections	Logical separation in shared infrastructure	Good; check namespace limits (Pinecone: 20 indexes)	Mid-market SaaS; moderate tenant count
Database per tenant	Turso / separate pgvector per tenant	Complete isolation	Excellent for regulated industries	Healthcare, BFSI, government where hard isolation required

⚠ APPROXIMATE NEAREST NEIGHBOR AND TENANT ISOLATION

ANN algorithms like HNSW traverse a graph of ALL vectors in the index, then apply post-filters. Without a pre-filter enforced at the database layer (like PostgreSQL RLS or Pinecone's namespace isolation), an ANN query can theoretically visit and score vectors from other tenants before the tenant filter is applied. Always enforce tenant isolation at the database layer, not just the application layer.

Performance Optimization: Beyond Index Tuning

Quantization: Shrink Without Sacrifice

Vector quantization reduces memory footprint by representing each dimension with fewer bits. Modern quantization methods maintain high recall while dramatically reducing cost:

Quantization Type	Memory Reduction	Recall Impact	Notes
None (float32)	Baseline	100%	Full precision; highest memory cost
int8 (scalar quantization)	75% reduction	>99%	Strong recall; Redis reports 99.99% retention
Binary quantization	96% reduction	~95%	Extreme compression; requires rescoring with full vectors
Product quantization (PQ)	~90% reduction	95–98%	FAISS standard; good balance; used in Pinecone serverless

Embedding Cache: Eliminate Redundant API Calls

Embedding API calls are the primary variable cost in RAG systems. Caching embeddings for recently seen queries typically reduces API costs by 30–60% in production systems where queries exhibit power-law distribution (a small fraction of queries account for most traffic).

// Embedding cache pattern (Redis + TTL)
async function cachedEmbed(text: string): Promise<number[]> {
  const cacheKey = `embed:${sha256(text)}`;

  // Check cache first (Redis GET)
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Cache miss: call embedding API
  const embedding = await embeddingAPI.embed(text);

  // Cache for 24 hours (embeddings are deterministic for same model)
  await redis.setex(cacheKey, 86400, JSON.stringify(embedding));

  return embedding;
}

Decision Framework: Choosing the Right Architecture

Use this decision framework to select the right vector database architecture for your specific situation. The most common mistake is over-engineering: teams choose dedicated vector databases before they have validated that PostgreSQL + pgvector cannot meet their requirements.

Your Situation	Recommended Architecture	Why
Existing PostgreSQL, <50M vectors	pgvector + pgvectorscale	471 QPS at 99% recall is sufficient; zero additional infra; lowest cost
Existing PostgreSQL, 50–100M vectors	pgvector + pgvectorscale (evaluate Pinecone if SLA not met)	Test with your actual workload before adding complexity
Greenfield, under 10M vectors, small team	Qdrant Cloud or Weaviate	Best developer experience; best free tier (Qdrant); minimal ops burden
Need hybrid search (vectors + keywords)	Weaviate	Native BM25 + vector fusion; best hybrid search implementation
10–100M vectors, want managed + reliable	Pinecone	Zero ops; proven SLAs; best support; 7ms p99
100M+ vectors, cost-sensitive, ops expertise	Milvus self-hosted / Zilliz Cloud	70%+ cost savings vs managed; scales to billions
Regulated industry, hard tenant isolation required	PostgreSQL RLS + pgvector or per-tenant databases	Database-native enforcement satisfies SOC 2 / HIPAA auditors

Conclusion

Vector databases are not magic. They are specialized index structures optimized for one operation — approximate nearest neighbor search in high-dimensional space. Understanding this constraint is the key to making good architectural decisions: use vector databases where their specific capability is required, and avoid adding them to systems where relational databases can meet the requirement.

The 2025 benchmark landscape has shifted significantly. pgvectorscale's performance at 50M vectors has narrowed the gap between PostgreSQL extensions and dedicated vector databases to the point where the "start simple" advice is now backed by hard performance numbers. The threshold for adding a dedicated vector database has moved from 10M vectors to 100M vectors for most workloads.

The architecture decision tree in plain language:

1. If you already run PostgreSQL and have under 100M vectors: start with pgvector. Measure. Upgrade if needed.
2. If you need hybrid search: add Weaviate or use pgvector's combined GIN + HNSW approach.
3. If you need managed and reliable above 100M vectors: Pinecone for zero-ops, Milvus for cost-control.
4. The embedding model matters more than the database. Invest in evaluating and upgrading your embedding model before tuning index parameters.
5. Chunking strategy matters more than people expect. 512 tokens with 64-token overlap is the right default for most applications. Measure retrieval quality, then optimize.

Sources & Further Reading

FireCrawl. (2025). Best Vector Databases of 2025: A Comparative Analysis.
Timescale. (2025). pgvectorscale Benchmarks: PostgreSQL vs. Pinecone for Vector Data.
NexAI. (2025). Vector vs. Relational Databases: Designing for AI.
World Journal of Advanced Engineering Technology and Sciences. (2025). Data Modeling for AI Systems.
Pinecone. (2025). Vector Database: What It Is & How It Works.

Vector Databases and Embedding Architecture for Enterprise AI

Contents

Executive Summary

What Vector Databases Actually Solve

Where SQL Cannot Go

The Seven Production Use Cases

How Embeddings Work: The Foundation

From Text to Vectors: The Pipeline

Distance Metrics: Choosing the Right One

HNSW: The Algorithm Powering Modern Vector Search

How HNSW Works

Vector Database Comparison: The Six That Matter

Enterprise Architecture Patterns

Pattern 1: Hybrid RAG (The Standard — 80% of Enterprise RAG)

Pattern 2: Startup-Optimal (PostgreSQL + pgvector)

Pattern 3: Growth-Scale Architecture (10M–500M Vectors)

Production RAG Pipeline Design

Indexing Phase Architecture

Retrieval Phase: The Query Expansion Pattern

Choosing Embedding Models

Chunking Strategy: The Underestimated Variable

Multi-Tenant Vector Architecture

Performance Optimization: Beyond Index Tuning

Quantization: Shrink Without Sacrifice

Embedding Cache: Eliminate Redundant API Calls

Decision Framework: Choosing the Right Architecture

Conclusion

Sources & Further Reading

Further Reading from the Author