Language:English VersionChinese Version

Database Indexes: The Optimization Most Developers Do Once and Never Revisit

Every developer knows that indexes speed up queries. Fewer understand why some indexes don’t get used, why certain queries are slow despite having indexes, how multi-column indexes really work, or when an index makes your application slower. This guide goes beyond “add an index on the WHERE clause column” to the mechanics and trade-offs that allow you to make intelligent decisions about indexing strategy.

How Indexes Actually Work: The B-Tree

The default index type in Postgres, MySQL, and most relational databases is a B-tree (balanced tree). Understanding its structure explains most indexing behavior.

A B-tree index is a sorted data structure where each node contains keys and pointers to child nodes or heap pages. When you query WHERE user_id = 4821, the database traverses the tree from root to leaf in O(log n) time rather than scanning every row in O(n) time.

Key implications:

  • B-tree indexes support equality (=), range (<, >, BETWEEN), and prefix matching (LIKE 'foo%')
  • They do NOT support suffix matching (LIKE '%foo') — the index is sorted from left, not right
  • They support ORDER BY — if you index on a column and sort by it, the database can return rows in index order without a sort step

Reading EXPLAIN ANALYZE: The Skill That Changes Everything

Before adding any index, understand why the existing query is slow. EXPLAIN ANALYZE shows the actual execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, u.email
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'pending'
  AND o.created_at > NOW() - INTERVAL '7 days'
ORDER BY o.created_at DESC
LIMIT 50;

Sample output and how to read it:

Limit  (cost=1842.34..1842.47 rows=50 width=52)
       (actual time=28.432..28.441 rows=50 loops=1)
  ->  Sort  (cost=1842.34..1842.62 rows=111 width=52)
             (actual time=28.430..28.434 rows=50 loops=1)
        Sort Key: o.created_at DESC
        Sort Method: top-N heapsort  Memory: 30kB
        ->  Hash Join  (cost=412.50..1838.71 rows=111 width=52)
                        (actual time=5.231..27.891 rows=423 loops=1)
              Hash Cond: (o.user_id = u.id)
              Buffers: shared hit=891 read=243
              ->  Seq Scan on orders o
                    (cost=0.00..1422.17 rows=111 width=36)
                    (actual time=0.012..22.341 rows=423 loops=1)
                    Filter: ((status = 'pending') AND
                             (created_at > (now() - '7 days'::interval)))
                    Rows Removed by Filter: 38291

Red flags to look for:

  • Seq Scan on a large table — scanning every row
  • Rows Removed by Filter significantly larger than rows returned — low selectivity
  • Buffers: read= large number — reading from disk rather than cache
  • actual time much larger than cost — statistics are out of date, run ANALYZE

Multi-Column Indexes: Column Order Is Everything

A multi-column index on (status, created_at) is NOT the same as one on (created_at, status). The leftmost prefix rule determines which queries can use the index.

-- This index:
CREATE INDEX orders_status_created_idx ON orders (status, created_at DESC);

-- Can be used for:
WHERE status = 'pending'                              -- uses index
WHERE status = 'pending' AND created_at > '2026-01-01' -- uses index (both columns)
WHERE status = 'pending' ORDER BY created_at DESC     -- uses index, avoids sort

-- CANNOT be used for:
WHERE created_at > '2026-01-01'                       -- skips first column
WHERE status LIKE '%pending%'                          -- not a prefix search

The rule of thumb for column ordering in multi-column indexes:

  1. Equality conditions first (columns you filter with =)
  2. Range conditions last (columns you filter with <, >, BETWEEN)
  3. Sort columns at the end if you want index-ordered results

Partial Indexes: Index Only What You Query

Partial indexes only include rows matching a condition. They’re smaller, faster to update, and often dramatically more selective:

-- Full index on status — includes 'completed', 'cancelled', 'refunded' rows
-- But 99% of queries only look at 'pending' orders
CREATE INDEX orders_created_idx ON orders (created_at DESC);

-- Partial index — only indexes the rows you actually query
CREATE INDEX orders_pending_created_idx ON orders (created_at DESC)
  WHERE status = 'pending';

-- This is used for queries with the matching WHERE condition
SELECT * FROM orders
WHERE status = 'pending' AND created_at > NOW() - INTERVAL '7 days'
ORDER BY created_at DESC;

If 1% of your orders are pending and 99% are completed, a partial index on pending orders is roughly 100x smaller and 100x faster to maintain than a full index.

Covering Indexes: Eliminating the Table Heap Fetch

After an index lookup, the database typically has to fetch the actual row from the table heap to get non-indexed columns. A covering index includes all columns needed by the query, eliminating this second fetch:

-- Query fetching user_id, status, total, created_at from orders
-- Regular index on user_id requires a heap fetch for other columns
CREATE INDEX orders_user_idx ON orders (user_id);

-- Covering index: all needed columns included
-- The query can be satisfied entirely from the index
CREATE INDEX orders_user_covering_idx
  ON orders (user_id)
  INCLUDE (status, total, created_at);

-- PostgreSQL 11+ syntax using INCLUDE
-- The INCLUDE columns are in the leaf pages but not used for sorting/searching
-- EXPLAIN output showing index-only scan (no heap fetch)
Index Only Scan using orders_user_covering_idx on orders
  (cost=0.43..8.45 rows=1 width=32)
  (actual time=0.021..0.023 rows=3 loops=1)
  Index Cond: (user_id = 4821)
  Heap Fetches: 0  -- <-- Zero table fetches

Index Types Beyond B-Tree

GIN Indexes: Arrays, JSONB, and Full-Text Search

-- GIN index for JSONB containment queries
CREATE INDEX events_payload_gin ON events USING GIN (payload);

-- Supported operators
SELECT * FROM events WHERE payload @> '{"type": "purchase"}';
SELECT * FROM events WHERE payload ? 'user_id';

-- GIN for full-text search
CREATE INDEX articles_search_gin ON articles USING GIN (to_tsvector('english', content));
SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('english', 'database & index');

BRIN Indexes: Huge Append-Only Tables

-- BRIN (Block Range INdex) for naturally ordered time-series data
-- Tiny index size: one entry per block range, not per row
CREATE INDEX events_created_brin ON events USING BRIN (created_at)
  WITH (pages_per_range = 128);

-- 10x-100x smaller than B-tree for append-only time-series
-- Works because rows are physically stored in timestamp order
-- Only useful when physical order correlates with the indexed column

When Indexes Make Things Slower

Every index has a write cost. Each INSERT, UPDATE, or DELETE must update all applicable indexes. In write-heavy scenarios:

  • A table with 10 indexes takes ~10x longer to bulk-load than a table with no indexes
  • High-frequency UPDATEs to indexed columns create index bloat over time
  • Too many indexes on an OLTP table can make writes the bottleneck
-- When bulk-loading data, drop indexes first
DROP INDEX orders_status_created_idx;
DROP INDEX orders_user_covering_idx;

-- Load data
COPY orders FROM '/tmp/orders_backup.csv' WITH CSV;

-- Rebuild indexes after (faster than incremental updates during load)
CREATE INDEX CONCURRENTLY orders_status_created_idx ON orders (status, created_at DESC);
CREATE INDEX CONCURRENTLY orders_user_covering_idx ON orders (user_id) INCLUDE (status, total, created_at);

Use CREATE INDEX CONCURRENTLY to build indexes without locking the table.

Finding Unused Indexes

-- Find indexes that haven't been used since last statistics reset
SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelid NOT IN (
    SELECT conindid FROM pg_constraint  -- Don't drop constraint indexes
  )
ORDER BY pg_relation_size(indexrelid) DESC;

Unused indexes are pure overhead — they slow down writes without speeding up reads. Drop them.

The Indexing Workflow

  1. Identify slow queries from your query monitoring (pg_stat_statements, Datadog, etc.)
  2. Run EXPLAIN ANALYZE to understand the execution plan
  3. Add the most selective index (or partial index) for the query pattern
  4. Use CREATE INDEX CONCURRENTLY in production
  5. Verify the plan changed with EXPLAIN ANALYZE again
  6. Monthly: audit unused indexes and drop them

Good indexing is an ongoing practice, not a one-time setup. Query patterns change, data distributions shift, and indexes that made sense at launch become liabilities as your data grows. Treat your index inventory like any other part of your schema: review it regularly, and remove what isn’t earning its keep.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *