Database Design and Optimization Guide

Effective database design and optimization are crucial for application performance and scalability. This guide covers everything from fundamental design principles to advanced optimization techniques, helping you build efficient, maintainable database systems.

1. Database Design Fundamentals

Normalization vs. Denormalization

Understanding when to normalize and when to denormalize is critical for optimal database design:

When to Normalize:

High data integrity requirements (financial systems)
Frequent updates and deletes
Storage space is limited
Data consistency is paramount

When to Denormalize:

Read-heavy workloads (analytics, reporting)
Performance is critical
Data is relatively static
Complex join operations are frequent

Entity Relationship Design

Start with a well-designed entity relationship model to avoid future performance issues:

-- Example: E-commerce database design

-- Users table (normalized)
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    INDEX idx_email (email),
    INDEX idx_created_at (created_at)
);

-- Products table
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    price DECIMAL(10,2) NOT NULL,
    category_id BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_category (category_id),
    INDEX idx_price (price),
    INDEX idx_name (name),
    FOREIGN KEY (category_id) REFERENCES categories(id)
);

-- Orders table (denormalized for performance)
CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    user_email VARCHAR(255), -- Denormalized for faster queries
    total_amount DECIMAL(10,2) NOT NULL,
    status ENUM('pending', 'processing', 'shipped', 'delivered') DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_user_id (user_id),
    INDEX idx_status (status),
    INDEX idx_created_at (created_at),
    INDEX idx_user_status (user_id, status),
    FOREIGN KEY (user_id) REFERENCES users(id)
);

2. Indexing Strategies

Proper indexing is one of the most effective ways to improve database performance. Understanding different index types and when to use them is essential.

Index Types and Use Cases

Index Type	Best For	Example
B-Tree (Default)	Range queries, ordering, equality	WHERE age BETWEEN 18 AND 65
Hash Index	Exact equality lookups	WHERE user_id = 12345
Full-Text Index	Text search queries	MATCH(title) AGAINST('search term')
Partial Index	Conditional indexing	WHERE status = 'active'
Composite Index	Multiple column queries	WHERE user_id = ? AND status = ?

Composite Index Optimization

The order of columns in composite indexes matters significantly for performance:

-- Rule: Most selective column first
-- If you frequently query by:
-- 1. user_id AND status (high selectivity)
-- 2. user_id only
-- 3. status only (low selectivity)

-- Good: Most selective column first
CREATE INDEX idx_user_status ON orders (user_id, status);

-- This index can efficiently handle:
SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';  -- Uses full index
SELECT * FROM orders WHERE user_id = 12345;                         -- Uses partial index

-- Less optimal
CREATE INDEX idx_status_user ON orders (status, user_id);

-- This would be less efficient for user_id-only queries

-- Cardinality example:
-- user_id: 1,000,000 unique values (high selectivity)
-- status: 5 unique values (low selectivity)
-- created_at: High selectivity for recent dates

Index Maintenance Tips

Monitor index usage with EXPLAIN ANALYZE
Remove unused indexes to improve write performance
Consider partial indexes for filtered queries
Update table statistics regularly for optimal query planning

3. Query Optimization Techniques

Query Analysis with EXPLAIN

Use EXPLAIN to understand how your database executes queries and identify optimization opportunities:

-- Analyze query execution plan
EXPLAIN ANALYZE 
SELECT u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at >= '2024-01-01'
GROUP BY u.id, u.email
HAVING COUNT(o.id) > 5;

-- Output analysis points:
-- 1. Scan type: Index Scan vs Sequential Scan vs Nested Loop
-- 2. Estimated vs Actual rows
-- 3. Cost estimates
-- 4. Join algorithms used
-- 5. Index usage

-- Example output interpretation:
-- Hash Join (cost=1234.56..5678.90 rows=100 width=64)
--   Hash Cond: (o.user_id = u.id)
--   ->Seq Scan on orders o (cost=0.00..2345.67 rows=50000 width=16)
--   ->Hash (cost=567.89..567.89 rows=1000 width=52)
--        ->Index Scan using idx_created_at on users u

Common Query Patterns and Optimizations

Problematic Query Patterns

-- N+1 Query Problem
SELECT * FROM users;
-- Then for each user:
SELECT * FROM orders WHERE user_id = ?;

-- Function in WHERE clause (prevents index usage)
SELECT * FROM users WHERE YEAR(created_at) = 2024;

-- Leading wildcard (can't use index)
SELECT * FROM products WHERE name LIKE '%phone%';

Optimized Alternatives

-- Single query with JOIN
SELECT u.*, o.id as order_id, o.total_amount
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Date range (index-friendly)
SELECT * FROM users 
WHERE created_at >= '2024-01-01' 
  AND created_at < '2025-01-01';

-- Full-text search or suffix search
SELECT * FROM products 
WHERE MATCH(name, description) AGAINST('phone' IN NATURAL LANGUAGE MODE);

Subquery vs JOIN Performance

Modern databases often optimize subqueries well, but JOINs are usually more predictable in performance:

-- Subquery (potentially slower)
SELECT * FROM users 
WHERE id IN (
    SELECT DISTINCT user_id 
    FROM orders 
    WHERE created_at >= '2024-01-01'
);

-- Equivalent JOIN (usually faster)
SELECT DISTINCT u.* FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= '2024-01-01';

-- EXISTS vs IN (EXISTS often better for large datasets)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.user_id = u.id 
    AND o.created_at >= '2024-01-01'
);

4. Performance Monitoring and Tuning

Key Performance Metrics

Monitor these critical metrics to maintain optimal database performance:

Query Performance

Average query execution time
95th percentile response time
Slow query count
Queries per second

Resource Usage

CPU utilization
Memory usage
Disk I/O operations
Network throughput

Connection Health

Active connections
Connection pool usage
Lock wait time
Deadlock frequency

Database Configuration Tuning

Optimize database configuration parameters based on your workload and hardware:

-- MySQL/InnoDB Configuration Example

# Memory allocation (adjust based on available RAM)
innodb_buffer_pool_size = 4G          # 70-80% of available RAM
query_cache_size = 256M               # For read-heavy workloads
tmp_table_size = 256M
max_heap_table_size = 256M

# Connection settings
max_connections = 200                 # Based on application needs
connect_timeout = 10
wait_timeout = 300

# InnoDB settings for write performance
innodb_log_file_size = 256M
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 2    # Trade durability for performance
innodb_flush_method = O_DIRECT

# Query optimization
long_query_time = 1                   # Log queries slower than 1 second
slow_query_log = 1

5. Modern Database Patterns

Read Replicas and Sharding

Scale your database horizontally with read replicas and sharding strategies:

Read Replicas

Route read queries to replicas
Keep writes on primary database
Monitor replication lag
Handle eventual consistency

Horizontal Sharding

Partition data across servers
Choose appropriate shard keys
Avoid cross-shard queries
Plan for resharding

Connection Pooling

Implement connection pooling to manage database connections efficiently:

// Connection pool configuration example (Node.js)
const pool = new Pool({
  host: 'localhost',
  user: 'app_user',
  password: 'secure_password',
  database: 'myapp',
  
  // Pool settings
  min: 5,              // Minimum connections
  max: 20,             // Maximum connections
  acquireTimeoutMillis: 30000,  // Max time to get connection
  idleTimeoutMillis: 30000,     // Max time connection can be idle
  
  // Health checks
  testOnBorrow: true,
  validationQuery: 'SELECT 1',
  
  // Retry logic
  retryDelayMultiplier: 2,
  maxRetries: 3
});

Analyze Your Database Performance

Use our tools to optimize your database queries and design decisions.

JSON Formatter CSV to JSON Base Converter