THOR THUNDERSCAN · THESIS Data Architecture

Database Normalization & Schema Design

Why a Properly Normalized Schema Is the Load-Bearing Wall of Every Successful Software Application

Isaac Shi, Co-Founder & GP of Golden Section VC
April 7, 2025
Share on LinkedIn Share on X
THOR Text-to-SQL Engine All Articles SOC 2 Compliance

Executive Summary

Every software application — from a two-table side project to a billion-row enterprise platform — is only as solid as the relational schema beneath it. Poor normalization decisions made in week one routinely surface as catastrophic data inconsistencies, cascading bugs, and multi-year refactoring debts in year three. IBM research estimated that U.S. businesses alone lose $3.1 trillion annually to poor data quality — and the majority of those losses trace back not to hardware or network failures, but to structural design decisions that were never enforced at the schema level. A 2025 IBM IBV report confirms the problem is worsening: over a quarter of organizations now estimate losses exceeding $5 million annually due to data quality issues.

The core argument of this article:

Normalization is not a theoretical exercise from a 1970s textbook. It is the engineering discipline that determines whether your application can grow, whether your data can be trusted, and whether your AI initiatives have a foundation worth building on. The teams that treat schema design as a first-class architectural concern ship faster, debug less, and scale further.

$3.1T
lost annually to poor data quality in the U.S. alone (IBM/Redman, 2016)
68%
of critical business applications require multi-step transactional integrity to prevent data corruption (Percona, 2024)
4–10×
more expensive to fix tech debt post-launch than at design time (McKinsey Digital, 2022)

Why Normalization Is Architecture

Edgar F. Codd introduced relational normalization in his landmark 1970 paper "A Relational Model of Data for Large Shared Data Banks". His insight was deceptively simple: by organizing data into relations (tables) that follow precise mathematical rules, you eliminate entire classes of anomalies — insert anomalies, update anomalies, and delete anomalies — at the structural level, before a single line of application code is written.

Most engineers understand normalization superficially: "don't repeat data." But the deeper principle is about dependency management. Every column in a table should depend on the key, the whole key, and nothing but the key. When that rule breaks down, your database becomes an inconsistency engine — a system that allows true and false to coexist silently.

The Three Fundamental Anomalies Normalization Prevents

INSERT ANOMALY

You cannot record a fact without recording an unrelated fact. Example: you cannot add a new course to your catalog until at least one student enrols — because the course lives inside the student-enrollment table.

UPDATE ANOMALY

A single real-world change requires updating multiple rows. If a customer changes their address and it is stored in every order row, you must update dozens of records in perfect unison — or the database becomes inconsistent, with the customer simultaneously having two addresses.

DELETE ANOMALY

Deleting one entity accidentally destroys knowledge about another. Example: deleting the last student enrolled in a course also deletes all metadata about that course — its instructor, credits, and description vanish with the record.

These are not edge cases. They are the everyday reality of applications built on under-normalized schemas. ThunderScan detects all three anomaly patterns through static schema analysis — and finds at least one in over 80% of the production databases it scans.

First Principles Framing

Think of your database schema the way a structural engineer thinks about a building's load-bearing walls. You can renovate freely as long as you never cut a load-bearing wall. In databases, the load-bearing walls are your normalization structure and referential constraints. Violate them once, and every subsequent floor you add increases structural risk.

The Normal Forms, Explained for Practitioners

Normal forms are progressive levels of schema discipline. Each higher form adds a stricter rule about how columns can depend on each other. In practice, achieving Third Normal Form (3NF) is the industry standard for transactional systems, while Boyce-Codd Normal Form (BCNF) covers edge cases in tables with multiple candidate keys.

1NF
First Normal Form — Atomic Values

Every column contains only atomic (indivisible) values. No repeating groups, no comma-separated lists, no JSON blobs masquerading as columns. Each row is uniquely identifiable by a primary key.

VIOLATION EXAMPLE
users.roles = "admin,editor,viewer" — storing multiple values in one column breaks 1NF and makes every query that filters by role either wrong or catastrophically slow.
2NF
Second Normal Form — No Partial Dependencies

Applies to tables with composite primary keys. Every non-key column must depend on the entire composite key, not just part of it. A table with a (order_id, product_id) composite key should not store the product's description — that depends only on product_id, making it a partial dependency.

-- ✗ VIOLATES 2NF: product_name depends only on product_id, not the full key CREATE TABLE order_items ( order_id INT, product_id INT, product_name VARCHAR(255), -- partial dependency! quantity INT, PRIMARY KEY (order_id, product_id) ); -- ✓ 2NF-compliant: separate the product entity CREATE TABLE products ( product_id INT PRIMARY KEY, product_name VARCHAR(255) NOT NULL ); CREATE TABLE order_items ( order_id INT, product_id INT REFERENCES products(product_id), quantity INT NOT NULL, PRIMARY KEY (order_id, product_id) );
3NF
Third Normal Form — No Transitive Dependencies

Every non-key column must depend directly on the primary key, not on another non-key column. If column C depends on column B, which depends on the key A — that transitive dependency (A → B → C) must be broken by moving B and C into their own table.

-- ✗ VIOLATES 3NF: city depends on zip_code, not on user_id directly CREATE TABLE users ( user_id INT PRIMARY KEY, zip_code CHAR(5), city VARCHAR(100) -- transitive: user→zip→city ); -- ✓ 3NF-compliant CREATE TABLE zip_codes ( zip_code CHAR(5) PRIMARY KEY, city VARCHAR(100) NOT NULL ); CREATE TABLE users ( user_id INT PRIMARY KEY, zip_code CHAR(5) REFERENCES zip_codes(zip_code) );
BCNF
Boyce-Codd Normal Form — Every Determinant is a Candidate Key

A stricter version of 3NF that handles edge cases with multiple overlapping candidate keys. For every non-trivial functional dependency X → Y, X must be a superkey. In practice, most well-designed 3NF schemas are already BCNF-compliant unless the table has two or more composite candidate keys that share columns.

Practitioner Rule of Thumb: Aim for 3NF as your baseline. Denormalize deliberately — only for specific, measured performance needs — and document every deviation. Accidental denormalization is technical debt; intentional denormalization with compensating controls is a pragmatic engineering decision.
Rugged mountain trail — precision underfoot, every step placed with intention

Normalization is the bedrock trail of software architecture — unglamorous, disciplined, and the reason nothing collapses under load.

OLTP vs. OLAP: Two Databases, Two Design Philosophies

One of the most consequential and most frequently misunderstood decisions in data architecture is whether you are building a transactional system or an analytical system — because the schema design rules are fundamentally different, and conflating the two is a direct path to a system that does neither job well.

OLTP
Online Transaction Processing

Purpose: Power the live application. Handle every user action — sign-ups, orders, payments, updates — with low latency and ACID guarantees.

  • Highly normalized (3NF or BCNF)
  • Optimized for concurrent writes
  • Short, targeted transactions
  • Row-by-row access patterns
  • Indexes on PKs, FKs, lookup columns
  • Strong referential integrity (FK constraints)
  • Low data volume per query
Examples: PostgreSQL, MySQL, Oracle, SQL Server — powering e-commerce checkouts, banking transactions, CRM operations, SaaS feature workflows.
OLAP
Online Analytical Processing

Purpose: Power business intelligence, reporting, ML feature engineering, and data science. Aggregate across billions of rows to answer strategic questions.

  • Intentionally denormalized (star/snowflake schemas)
  • Optimized for column-scan reads
  • Long-running analytical queries
  • Aggregations over millions of rows
  • Columnar storage (Parquet, ORC)
  • Relaxed FK enforcement for performance
  • High data volume per query
Examples: Snowflake, BigQuery, Redshift, ClickHouse — powering executive dashboards, revenue attribution, churn prediction, cohort analysis.

The OLTP vs. OLAP Design Matrix

Design DimensionOLTPOLAP
Schema styleNormalized (3NF/BCNF)Denormalized (Star/Snowflake)
Primary operationsINSERT, UPDATE, DELETESELECT, GROUP BY, window functions
Latency target< 10 ms per querySeconds to minutes acceptable
Concurrency modelThousands of concurrent usersTens of concurrent analysts
Data freshnessReal-time, millisecond precisionHourly / daily ETL is standard
Storage formatRow-oriented (heap files)Column-oriented (Parquet, ORC)
Index strategyB-tree on PKs, FKs, filtersBitmap, zone maps, bloom filters
Join styleMany normalized joinsPre-joined wide tables preferred
Data volumeGB to low-TB rangeTB to PB range
Integrity enforcementStrict FK + CHECK constraintsApplication-layer or ETL-layer
History trackingLatest state (mutable rows)Immutable append-only history
Failure impactUser-facing — immediate, visibleReporting lag — delayed, reported

The Star Schema: OLAP's Answer to Normalization

Where OLTP uses 3NF to eliminate redundancy, OLAP uses the Star Schema to deliberately re-introduce controlled redundancy in exchange for query performance. A star schema has one central fact table (containing measurable business events: sales amounts, page views, transactions) surrounded by dimension tables (describing the who, what, where, when of those events).

-- OLAP Star Schema: a sales data warehouse CREATE TABLE dim_customers ( customer_key INT PRIMARY KEY, -- surrogate key customer_id VARCHAR(36), -- natural key from OLTP full_name VARCHAR(200), country VARCHAR(100), segment VARCHAR(50) -- denormalized for fast GROUP BY ); CREATE TABLE dim_products ( product_key INT PRIMARY KEY, product_id VARCHAR(36), product_name VARCHAR(255), category VARCHAR(100), -- denormalized from category table brand VARCHAR(100) ); CREATE TABLE dim_dates ( date_key INT PRIMARY KEY, full_date DATE, year INT, quarter INT, month INT, week_of_year INT, is_holiday BOOLEAN ); -- The fact table: one row per business event CREATE TABLE fact_sales ( sale_key BIGINT PRIMARY KEY, customer_key INT REFERENCES dim_customers, product_key INT REFERENCES dim_products, date_key INT REFERENCES dim_dates, amount NUMERIC(12,2), quantity INT, discount_pct NUMERIC(5,2) );
WHY THIS WORKS FOR OLAP

Analysts can query SELECT country, SUM(amount) FROM fact_sales JOIN dim_customers USING (customer_key) GROUP BY country in seconds because the country is pre-joined into the dimension, avoiding runtime joins across normalized geography tables. The denormalization cost (storage) is worth the analytical gain.

THE CRITICAL MISTAKE: MIXING OLTP AND OLAP WORKLOADS

Running analytical queries directly against your production OLTP database is one of the most common and most expensive operational mistakes in B2B SaaS. A single SELECT COUNT(*) FROM events GROUP BY user_id that scans 50M rows can lock pages, spike CPU to 100%, and cause timeouts for every active user — all from one analyst's ad-hoc report. The correct architecture is a separate data warehouse (Snowflake, BigQuery, Redshift) fed by a CDC pipeline from OLTP, keeping both systems optimized for their intended workloads.

Referential Integrity & Foreign Key Dependencies

Referential integrity is the guarantee that every foreign key value in one table corresponds to an existing row in the referenced table. It is the database's equivalent of a contract: if an order references customer_id = 42, then a customer with id = 42 must exist. Without this contract, your data is a legal fiction — a set of claims that may or may not correspond to reality.

Shockingly, a large fraction of production systems — particularly those built rapidly in early startup stages or migrated from NoSQL — operate without foreign key constraints. Developers disable them for "performance" or "flexibility," not realizing they are trading away the foundation of data reliability for a marginal write throughput gain.

WITH FK CONSTRAINTS
  • ✓ Database rejects invalid references at insert time
  • ✓ Cascading deletes/nulls handled automatically
  • ✓ Query optimiser can exploit FK for join planning
  • ✓ Self-documenting schema relationships
  • ✓ AI tools can auto-infer entity relationships
WITHOUT FK CONSTRAINTS
  • ✗ Silent orphan row accumulation over months
  • ✗ Application logic must manually enforce integrity
  • ✗ Bugs in one service corrupt another service's view
  • ✗ Impossible to reason about data correctness
  • ✗ AI models trained on corrupted relationship data

Foreign Key Cascade Strategies

When you delete or update a parent row, the database must decide what to do with child rows that reference it. PostgreSQL (and most RDBMS) offer four strategies, each with distinct semantics:

StrategyON DELETE / ON UPDATEBest Used WhenRisk
CASCADEDeletes / updates children automaticallyChildren have no independent existence (order_items → orders)Accidental mass deletion if a parent is deleted without warning
SET NULLSets FK column to NULL in childrenChildren may exist independently (posts → deleted users)Requires FK column to be nullable; must handle NULLs in queries
RESTRICTPrevents deletion if children existHard business rule — a company with employees cannot be deletedCan frustrate UX if user-facing; must present clear error
SET DEFAULTSets FK to column's default valueReassigning orphans to a "system" or "unknown" entityDefault must reference a valid parent; easy to get wrong
-- Well-designed referential integrity with explicit cascade strategy CREATE TABLE organizations ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, name TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE users ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE RESTRICT, email TEXT UNIQUE NOT NULL, created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE audit_logs ( id BIGSERIAL PRIMARY KEY, user_id UUID REFERENCES users(id) ON DELETE SET NULL, -- preserve logs even after user deletion action TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE orders ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, user_id UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT, created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE order_items ( id BIGSERIAL PRIMARY KEY, order_id UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE, -- items die with the order product_id UUID NOT NULL REFERENCES products(id) ON DELETE RESTRICT, quantity INT NOT NULL CHECK (quantity > 0) );

The "FK Kills Performance" Myth

The most common objection to foreign keys is that they slow down writes. In modern PostgreSQL (v14+) and MySQL (v8.0+), a well-indexed foreign key costs approximately 1–3% write overhead. The integrity guarantee it provides is worth orders of magnitude more in avoided debugging time, data recovery costs, and audit compliance. The only valid reason to defer FK enforcement is during bulk-load operations — and even then, constraints should be re-enabled and validated immediately after the load completes.

Orphan Data & Cascading Failures

An orphan row is a record whose foreign key references a parent that no longer exists. It is a ghost: present in the database, but referring to nothing real. Orphan data is the single most common data corruption pattern ThunderScan detects — and it is almost always invisible until it causes a customer-facing failure or an audit exception.

How Orphans Are Born

Orphan rows accumulate through predictable failure modes:

Root CauseMechanismTypical Scale of Orphans
No FK constraintApplication deletes parent but forgets children; or two services act on the same data independentlyOften millions of rows over months
Async microservicesService A deletes a user; Service B still creates records for that user_id before receiving the eventDozens to thousands per incident
Migration scriptsData migrations that copy parent records without copying children, or vice versaThousands to millions in one event
Soft delete inconsistency"Deleted" parent still has children that queries exclude, but reports still countEntire child tables affected
Multi-tenancy bugsTenant isolation failure: records created under tenant A reference parent rows under tenant BScattered, hard to detect

Real-World Consequences of Orphan Data

NPE
Null pointer exceptions when application code dereferences a parent that no longer exists
$$$
Billing discrepancies when orphaned subscription records produce invoices for deleted users
GDPR
Compliance violations when "deleted" users' data persists in orphaned child tables after an erasure request

Beyond application errors, orphan data is a GDPR Article 17 ("Right to Erasure") time bomb. When a user requests deletion, the naive implementation deletes the user row. Without cascades or compensating cleanup logic, every order, session, event, comment, and audit log referencing that user_id remains intact — a directly linked personal data set that is now legally required to be gone but physically still present.

Detecting and Eliminating Existing Orphans

-- Detect orphan order_items where the parent order was deleted SELECT oi.id, oi.order_id, oi.product_id FROM order_items oi LEFT JOIN orders o ON o.id = oi.order_id WHERE o.id IS NULL; -- Detect orphan audit_logs where the user was hard-deleted SELECT al.id, al.user_id, al.action, al.created_at FROM audit_logs al LEFT JOIN users u ON u.id = al.user_id WHERE al.user_id IS NOT NULL -- intentionally set NULLs are OK AND u.id IS NULL; -- but broken references are not -- Remediate: archive orphans before deleting (never delete without archiving) INSERT INTO orphan_archive SELECT 'order_items' AS source_table, id::text, now() FROM order_items WHERE order_id NOT IN (SELECT id FROM orders);
SOFT DELETE TRAP

Implementing soft deletes via a deleted_at TIMESTAMPTZ column solves the orphan problem only if every query that filters parents also propagates the WHERE deleted_at IS NULL condition to the children. The most common mistake is filtering out soft-deleted parents in application code while leaving their children visible, creating "zombie" child records that appear active but belong to deleted entities. Use database views or row-level security policies to enforce soft-delete filtering at the database layer, not the application layer.

Still mountain lake at dawn — calm, ordered, every reflection in its right place

God Tables & the Anti-Pattern Taxonomy

A God Table is a single database table that has grown — through years of feature additions, misplaced convenience, and unreviewed migrations — to own dozens or hundreds of columns, represent multiple distinct entities, and serve as the catch-all home for data that has nowhere better to go. It is the database equivalent of a God Object in object-oriented programming: a single class that knows too much and does too much.

THE GOD TABLE IN THE WILD

ThunderScan regularly encounters tables with 80–200+ columns in production SaaS databases. One common variant: a users table that contains not just user identity, but also billing address, Stripe subscription state, Salesforce CRM ID, last login device fingerprint, product feature flags, gamification points, email notification preferences, and A/B test cohort assignments — all in a single 120-column table. This table is queried by every feature, indexed in 40 different ways, and impossible to change without a full regression suite.

Why God Tables Form

God tables do not appear overnight. They grow through predictable engineering shortcuts:

Growth PatternHow It HappensLong-Term Cost
Feature Gravity"Users already exist, let's add the new column to users"Table lock contention on every schema change
Migration LazinessAdding nullable columns is easier than creating a new tableMost queries fetch columns they don't need
No Schema ReviewPRs add columns without architectural oversightNo clear entity boundary; impossible to test in isolation
Copy-Paste SeedingA wide seed CSV is imported and the columns become permanentStructural redundancy with no owner
NoSQL MigrationMongoDB document with 60 fields is naively mapped to a 60-column tableEntire document loaded for every access pattern

Decomposing a God Table

The remedy is entity extraction: identify distinct concerns within the God Table and decompose them into separate tables connected by a shared primary key (1-to-1 relationship). This is called vertical partitioning.

-- ✗ God Table: 'users' with 80+ columns covering five distinct concerns -- users (id, email, ..., stripe_customer_id, stripe_subscription_status, -- salesforce_lead_id, billing_address_line1, ..., feature_flag_dark_mode, -- feature_flag_ai_assistant, notification_email, notification_slack, ...) -- ✓ Decomposed into single-responsibility tables CREATE TABLE users ( -- identity only id UUID PRIMARY KEY, email TEXT UNIQUE NOT NULL, created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE user_billing ( -- billing concern user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE, stripe_customer_id TEXT, stripe_subscription TEXT, billing_address JSONB ); CREATE TABLE user_crm ( -- CRM integration user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE, salesforce_lead_id TEXT, hubspot_contact_id TEXT ); CREATE TABLE user_feature_flags ( -- feature flags: one row per flag user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, flag_name TEXT NOT NULL, enabled BOOLEAN NOT NULL DEFAULT false, PRIMARY KEY (user_id, flag_name) ); CREATE TABLE user_notifications ( -- notification preferences user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE, email BOOLEAN NOT NULL DEFAULT true, slack BOOLEAN NOT NULL DEFAULT false, webhook TEXT );
BENEFITS OF DECOMPOSITION

Each table now has a single owner and a single reason to change. Feature flag changes never lock the billing table. Billing schema changes are isolated from identity. Queries fetch only what they need. New engineers can understand each table in isolation. And AI tools can accurately classify entity types — a prerequisite for any LLM-powered schema analysis or feature engineering pipeline.

7 Common Schema Design Pitfalls

Beyond God Tables, ThunderScan detects a consistent taxonomy of seven schema design failures across production databases. Each one has a precise cause, a detectable signature, and a corrective action.

1 Using VARCHAR(255) Everywhere Type Laziness

Defaulting every text column to VARCHAR(255) is a cargo-cult habit inherited from MySQL's era when varchar length affected storage. In PostgreSQL, it has negligible storage impact — but it communicates nothing about the domain, prevents constraint-based validation, and signals that no one thought carefully about the data model. Use TEXT for unconstrained strings and appropriate CHECK constraints or ENUM types for domain-constrained values.

FIX
Use EMAIL TEXT CHECK (email ~* '^[^@]+@[^@]+$'), status TEXT CHECK (status IN ('active','inactive','pending')), and domain-specific lengths like phone_e164 VARCHAR(16) where the constraint is meaningful.
2 Storing Flags as Separate Boolean Columns Schema Sprawl

A table with is_admin, is_verified, is_beta_user, is_churned, is_trial, has_mobile_app as separate boolean columns will need a schema migration every time a new state is added. This pattern leads directly to 200-column God Tables. Instead, use a separate flags/features table (as shown above) or a JSONB column for dynamic flags, depending on whether you need to query by flag.

FIX
Use user_feature_flags (user_id, flag_name, enabled, granted_at) for queryable flags, or metadata JSONB for non-queryable attributes that are only read in aggregate.
3 Using Integer Auto-Increment PKs Across Microservices Distributed Systems Trap

Sequential integer primary keys expose your data volume to competitors (a new order with id=10043 implies your previous order was 10042 — revealing business metrics). More critically, in a distributed system or microservices architecture, sequential integers from two databases will collide. Use UUID (v4 for randomness, v7 for time-ordered) or ULID as primary keys for any entity that crosses service boundaries.

FIX
id UUID DEFAULT gen_random_uuid() PRIMARY KEY in PostgreSQL 13+. For time-ordered UUIDs (better index performance), use UUIDv7 via a custom function or the pg_uuidv7 extension.
4 Storing Derived Data as Columns Computed State Problem

Columns like total_orders INT on a users table are derived from counting the orders table. Unless there is a measured performance reason and a rigorous update strategy, storing computed aggregates is an update anomaly waiting to happen. The count column drifts from reality the moment any insert or delete bypasses the update logic — which happens inevitably via migrations, backfills, or direct SQL access.

FIX
Use SQL views, materialized views (refreshed on a schedule), or application-layer caching (Redis) for aggregates. Only store derived data in the base table when a generated column (GENERATED ALWAYS AS) or a trigger with a comprehensive test suite can guarantee correctness.
5 Polymorphic Foreign Keys (EAV Pattern) False Flexibility

The Entity-Attribute-Value (EAV) pattern and its cousin, the polymorphic association (a parent_type TEXT, parent_id UUID pair as a "foreign key" to any table), are attempts to build schema flexibility that instead build query complexity and destroy referential integrity. You cannot create a database-level FK constraint on a polymorphic parent. You lose all type safety, index efficiency, and query planner optimisation.

FIX
Use table inheritance, JSONB columns for dynamic attributes, or explicit join tables per entity type (post_comments, video_comments instead of a polymorphic comments table). For truly dynamic schemas, consider JSONB with GIN indexes over the EAV pattern.
6 Missing Created/Updated Timestamps Temporal Blindness

A table without created_at and updated_at is an analytical dead end. You cannot answer "how many users signed up last week?", "which records changed since my last sync?", or "what was the state of this entity at time T?" These columns cost nothing to add at creation and are extraordinarily expensive to reconstruct from audit logs or WAL after the fact.

FIX
Make created_at TIMESTAMPTZ DEFAULT now() NOT NULL and updated_at TIMESTAMPTZ DEFAULT now() NOT NULL mandatory in every table's base definition. Enforce updated_at via a trigger or an ORM lifecycle hook — never trust application code alone.
7 Case-Insensitive String Lookups Without Proper Collation Silent Data Duplication

Without explicit collation or normalization, user@Example.COM and user@example.com are treated as different email addresses by a UNIQUE constraint in most databases. This creates duplicate accounts, broken login flows, and compliance issues. Similarly, names stored with inconsistent casing prevent meaningful deduplication and create AI training noise.

FIX
Normalize at insert time: store emails as LOWER(TRIM(email)). For case-insensitive unique constraints in PostgreSQL, use CREATE UNIQUE INDEX ON users (LOWER(email)). For full case-insensitive collation, use the citext extension.

Best Practices Playbook

The following practices represent the distilled engineering standard for production relational schema design — the habits that differentiate teams that scale cleanly from those that spend their third year rewriting their data model.

Schema Integrity Rules

  • Enforce all FKs at the database level, not application level
  • Use NOT NULL as the default; nullable columns require explicit justification
  • Add CHECK constraints for domain-restricted values (status, type enums)
  • Never allow application writes to bypass constraints via raw SQL without review
  • Test cascade strategies in staging before deploying to production
  • Audit FK coverage as part of every database review cycle

Entity Design Rules

  • One table = one entity = one business concept
  • Tables with > 30 columns are candidates for vertical decomposition
  • Never co-locate billing, identity, and preferences in one table
  • Use junction tables for many-to-many relationships — never comma-separated IDs
  • Separate mutable state from immutable event history
  • Use UUID/ULIDs for cross-service entities; BIGSERIAL for internal-only tables

Migration & Change Management

  • Every schema change goes through a migration file — never hand-edit production
  • Add columns as nullable, then backfill, then add NOT NULL constraint
  • Use concurrent index creation (CREATE INDEX CONCURRENTLY) to avoid locks
  • Test migrations on a production-sized data clone before deployment
  • Keep a rollback migration for every forward migration
  • Review migration PRs with the same rigour as application code PRs

Performance Without Compromising Integrity

  • Profile before denormalizing — most "slow joins" are actually missing indexes
  • Use materialized views for expensive aggregations, not derived columns
  • Separate OLAP workloads to a replica or dedicated warehouse — never run analytics on OLTP primary
  • Use partial indexes for high-selectivity filters (WHERE deleted_at IS NULL)
  • Partition large tables by date range, not by adding columns
  • Document every intentional denormalization with a comment and a Jira link

The Base Table Template Every Table Should Start From

-- Every table in your schema should begin with this foundation CREATE TABLE entities ( -- Identity: use UUID for cross-service entities id UUID DEFAULT gen_random_uuid() PRIMARY KEY, -- Multi-tenancy: always scope to an organization if applicable organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE RESTRICT, -- Domain columns go here — typed precisely, constrained explicitly name TEXT NOT NULL CHECK (length(trim(name)) > 0), status TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active', 'inactive', 'archived')), -- Audit columns: mandatory on every table created_by UUID REFERENCES users(id) ON DELETE SET NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- Soft delete: use when hard deletes would create orphan risk deleted_at TIMESTAMPTZ -- NULL = not deleted ); -- Always index the FK columns and the soft-delete filter CREATE INDEX idx_entities_org ON entities(organization_id) WHERE deleted_at IS NULL; -- Auto-update updated_at on every write CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS TRIGGER AS $$ BEGIN NEW.updated_at = now(); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trg_entities_updated_at BEFORE UPDATE ON entities FOR EACH ROW EXECUTE FUNCTION touch_updated_at();

Schema Health Checklist

Use this checklist as a pre-launch gate and a quarterly schema review framework. ThunderScan runs the equivalent of this checklist automatically across your full production schema.

Integrity Gates

  • All FK relationships enforced at DB level
  • No tables with zero foreign keys (except root entities)
  • All FK columns have a supporting index
  • Cascade strategies explicitly documented
  • Zero orphan rows (validated by LEFT JOIN check)
  • CHECK constraints on all status/type columns
  • UNIQUE constraints on natural keys (email, slug)

Normalization Gates

  • No table exceeds 30 columns without explicit justification
  • No comma-separated values in any column
  • No repeated column groups (col1, col2, col3 for same concept)
  • All many-to-many relationships use junction tables
  • No polymorphic FKs without compensating design
  • Boolean flag columns (3+) extracted to flag table
  • No derived/computed columns without trigger or generated column

Temporal & Audit Gates

  • Every table has created_at TIMESTAMPTZ NOT NULL
  • Every mutable table has updated_at with trigger
  • Soft-delete pattern uses deleted_at (not is_deleted boolean)
  • Soft-delete filter enforced at DB layer (view or RLS)
  • Audit log table present for regulated entities
  • No DATETIME without timezone (always TIMESTAMPTZ)

Performance Gates

  • No analytical queries running against OLTP primary
  • All high-selectivity WHERE filters have a supporting index
  • No unused indexes (wasted write overhead)
  • Partial indexes used for filtered access patterns
  • Large tables (>10M rows) have a partition strategy
  • No sequential scans on production tables >100K rows

Conclusion

Database normalization is not a theoretical formalism that engineers graduate beyond once they have shipped a production system. It is the ongoing discipline that determines whether your application accumulates data debt or data capital. Every orphan row is a broken promise to a user. Every God Table is a tax levied on every engineer who touches the codebase. Every missing foreign key is an integrity guarantee that does not exist — only an assumption that the application will compensate.

The teams that treat schema design with the same rigour they apply to API design, system architecture, and security posture are the teams that can iterate faster in year three than in year one — because their data model remains a firm foundation rather than a shifting floor. Conversely, the teams that treat the database as a simple persistence layer, adding columns wherever convenient, disabling constraints for speed, and deferring normalization to "later" are the teams that spend years three through five rewriting instead of building.

The Four Principles to Take Away

1
Normalize by default. Denormalize by exception, with documentation, compensation, and a measured performance rationale.
2
Enforce integrity at the database layer. Application code is too unreliable to be the sole guardian of referential correctness.
3
Separate OLTP from OLAP. A schema that tries to serve both workloads optimally serves neither — architect the boundary explicitly.
4
Schema review is architecture review. Treat every ALTER TABLE migration PR with the same scrutiny as a core API change.

Thor ThunderScan was built on the premise that most database problems are preventable if caught at the schema layer, before they compound into application bugs, data corruption, and compliance failures. The checklist and patterns in this article are a reflection of the signals ThunderScan detects every day across real production databases — the same signals that separate a healthy data foundation from a silent liability.

Run ThunderScan on Your Schema

ThunderScan performs automated normalization analysis, orphan detection, FK coverage auditing, and OLTP/OLAP separation checks across your full production schema in minutes.

Start Your Free Scan

Sources & Further Reading

  1. Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 13(6), 377–387.
  2. IBM/Redman, T.C. (2016). "Bad Data Costs the U.S. $3 Trillion Per Year." Harvard Business Review; updated: IBM Institute for Business Value. (2025). "The True Cost of Poor Data Quality."
  3. Percona. (2024/2025). "State of Open Source Database Management 2025 — Annual Survey (covering 2024 data)."
  4. McKinsey Digital. (2022). "Demystifying digital dark matter: A new standard to tame technical debt."
  5. PostgreSQL Global Development Group. (2025). "PostgreSQL Documentation: Constraints."
  6. Kimball, R. & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Ed. Wiley.
  7. Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.
  8. AICPA-CIMA. (2022). 2017 Trust Services Criteria with Revised Points of Focus (2022). (Operative SOC 2 standard; no "2025 AI update" has been officially published.)

Further Reading from the Author

Isaac Shi writes about AI, software, and entrepreneurship at isaacshi.com. These essays provide the strategic and philosophical context behind this thesis.

Essay · Isaac Shi
Let There Be Light
Data without structure is "without form and void" — the case for why schema discipline is the foundation of business intelligence.
Essay · Isaac Shi
Cognitive Biases, Data & Decision Making
Good schema design requires deliberate System 2 thinking — Kahneman's framework for making sound analytical decisions under uncertainty.
Found this useful?
Share with your team and help more engineers build clean, scalable schemas.
Share on LinkedIn Share on X Download PDF
Continue Reading
© 2026 Thor ThunderScan  ·  ← Back to Thesis  ·  Start Scanning →