Why don't multi-column indices help queries on the second column?
SQL databases support multi-column indices, where a single index can incorporate data from multiple columns. These are powerful, but the order of the columns matters, with differences observable in practice: there can be orders of magnitude between two similar & simple queries, depending on which uses a prefix of the index columns. Why does this limitation exist? Can we have an mental model for the indices that explains the behaviour?
Yes, we can: thinking about indices as a sorted table with multiple columns provides that intuition:
- For a query involving the first columns, successive lookups on the relevant subsets of the index always see ordered data, making an efficient binary search possible,
- For a query involving only a later column, that column’s data is arranged haphazardly in the index, meaning there’s no hope of a binary search.
It’s the sorted nature of a database’s typical B-tree index that drives this trade-off.
Context
Imagine we’re building an app for silverware enthusiasts, so we set our table like so:
1
2
3
4
5
6
7
CREATE TABLE cutlery (
id BIGINT PRIMARY KEY,
-- fork, spoon, knife, straw, strork, etc.
kind TEXT,
-- silver, red, puce, etc.
colour TEXT
);
It turns out some people have a lot of forks and spoons, and want to find the item with the exact right shape and colour. Thus, we’ve discovered queries like this are common0 and very slow:
1
2
SELECT * FROM fork
WHERE kind = 'knork' AND colour = 'mahogany';
Changes
Slow query? We need an index. There’s a few options worth considering:
1
2
3
4
5
6
7
8
9
10
11
-- 1. index just kind:
CREATE INDEX idx_cutlery_kind
ON cutlery (kind);
-- 2. index just colour:
CREATE INDEX idx_cutlery_colour
ON cutlery (colour);
-- 3. index both, together
CREATE INDEX idx_cutlery_kind_colour
ON cutlery (kind, colour);
Options 1 & 2 are the simplest, and doing one (or both) may give sufficient performance. But maybe not. Option 3 is a multi-column index, that combines both kind and colour into one data structure. Let’s benchmark1 the options:
| Indexing | Time (ms) | Speedup vs ‘None’ |
|---|---|---|
| None | 420 | 1× |
| Just kind (1) | 4.78 | 88× |
| Just colour (2) | 4.60 | 91× |
| Both (1 + 2) | 1.28 | 328× |
| Multi-column (3) | 0.009 | 46700× |
Looks like we can get two orders of magnitude faster using one or both simple indices, and another two orders faster with the multi-column one. Optimising something to be 50000× faster is a good day’s work!
Confusion
Now, we were hyper-focused on just one particularly slow query, but most apps query their database in more than one way. Here’s two very similar queries, with each dropping one of the two filtering conditions from the WHERE clause:
1
2
3
4
-- kind-only, no colour
SELECT * FROM fork WHERE kind = 'knork';
-- colour-only, no kind
SELECT * FROM fork WHERE colour = 'mahogany';
After adding the multi-column idx_cutlery_kind_colour index, the benchmarks1 prove the kind-only query gets much faster, but the colour-only is still slow2.
| Query | No index (ms) | With index (ms) | Speedup |
|---|---|---|---|
| Both columns | 420 | 0.009 | 46700× |
| Kind-only | 403 | 4.79 | 84× |
| Colour-only | 480 | 492 | 1× |
Both of the “only” queries just drop one and keep one column compared to our original, but there’s a gulf between them.
Why does the multi-column index improve one a lot, and the other not at all? Why is kind-only two orders of magnitude faster than the colour-only with this index?
Callouts
The database docs answer the immediate question. For instance, PostgreSQL (emphasis mine):
the index is most efficient when there are constraints on the leading (leftmost) columns
And, similarly, MySQL (emphasis still mine):
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on
(col1, col2, col3), you have indexed search capabilities on(col1),(col1, col2), and(col1, col2, col3).
Let’s bring the index and the single-column queries together:
1
2
3
4
5
6
CREATE INDEX idx_cutlery_kind_colour
ON cutlery (kind, colour);
-- kind-only, no colour
SELECT * FROM fork WHERE kind = 'knork';
-- colour-only, no kind
SELECT * FROM fork WHERE colour = 'mahogany';
Comparing the queries and the index shows that, yes, indeed:
- The now-faster kind-only query uses the first column
kindof the index, and thus the index is applicable and efficient. - The still-slow colour-only query uses only the second column
colourof the index. This is not using a leftmost prefix of the index’s columns (kindis missing), and thus the index isn’t applicable.
But, why? Why does an index like this have a constraint like that?
Concepts
The default index in most SQL databases is ordered, relying on binary search for efficiency3, getting that O(log n) tastiness.
We can visualise this in an flat ordered table, with the index columns plus a ‘locations’ column4, ordered by the index columns, using each column of the multi-column index as a tie-breaker if the earlier ones are equal. We’re working with text columns, so everything is alphabetical.
Here’s a version for 7 rows, with a few different kind and colour values:
kind |
colour |
locations |
|---|---|---|
| chopsticks | mahogany | 5 |
| chopsticks | zebra | 3 |
| knork | mahogany | 2, 7 |
| knork | salmon | 4 |
| tongs | aqua | 1 |
| tongs | periwinkle | 6 |
Column-lookup: both
This index works well for filtering WHERE kind = 'knork' AND colour = 'mahogany', with two conceptual steps:
-
Do a binary search on the
kindcolumn to find the three knork entries, efficiently skipping over the chopsticks and tongs. The column is ordered, so binary search can work.kindcolourlocations knork mahogany 2, 7 knork salmon 4 -
Do a second binary search on the reduced
colourcolumn to find the knork rows that are also mahogany. Now that we’ve restricted to onlykind = 'knork'rows, the remaining two values are ordered too:'mahogany' < 'salmon'(at least, they are alphabetically, I know nothing of your colour preferences).kindcolourlocations knork mahogany 2, 7
Thus, we’ve done two efficient binary searches to find the rows of interest, at locations 2 & 7.
Column-lookup: kind
Similarly, the index works well for filtering WHERE kind = 'knork'. The process for finding locations 2, 4, & 7 looks remarkably familiar, just one step shorter:
-
Binary search the
kindcolumn to find these matching entries.kindcolourlocations knork mahogany 2, 7 knork salmon 4
The index is applicable for searching by kind by just doing the first step of the search process, and ignoring that the colour column even exists.
Column-lookup: colour
However, it doesn’t work for WHERE colour = 'mahogany': the table is not globally ordered by colour. We can see this viscerally:
colour |
locations |
|---|---|
| mahogany | 5 |
| zebra | 3 |
| mahogany | 2, 7 |
| salmon | 4 |
| aqua | 1 |
| periwinkle | 6 |
That’s the original ‘index’ table, ignoring the irrelevant (for this query) kind column. The colour column is in total disarray, with mahogany appearing both after zebra and before aqua.
Without sorted data, binary search can’t help us. Without binary search, this index can do nothing!
This flat table analogy thus gives some intuition for why our colour-only query remains slow, even with the multi-column index: there’s no sorted data for look-ups to use.
Conceit
I’ve tailored the numbers here: the benchmarks above were run with PostgreSQL (PG) 17. The slow colour-only comparison looks quite different with PG 181, but the other two barely change:
| Query | PG | No index (ms) | With index (ms) | Speedup |
|---|---|---|---|---|
| Both | 17.7 | 420 | 0.009 | 46700× |
| 18.1 | 399 | 0.008 | 49800× | |
| Kind-only | 17.7 | 403 | 4.79 | 84× |
| 18.1 | 386 | 3.76 | 103× | |
| Colour-only | 17.7 | 480 | 492 | 1× |
| 18.1 | 466 | 15.6 | 30× |
The last row shows our colour-only query gets faster with the multi-column index in the newer PostgreSQL.
It is still 4× slower than the superficially-similar kind-only query, and its speedup with the index is much smaller (“only” 30× instead of 103×)… but it is clear that the multi-column index is now helpful, despite our query being on the second colum only.
This seemingly violates what we just learned: we can’t binary search on the colour column with our kind-then-colour index? Has the index structure changed? Can we no longer use a sorted table analogy to guide our intuition?
No, the index data structure is the same, but Postgres 18 has learned a ‘skip scan’ method for using the index. It’s a nifty optimisation that relies on the later columns not actually being random: those columns are partially sorted, with runs of sorted data. The flat table analogy is still relevant for building an intuition for this optimisation…
… But this article is already long enough, and it is easier to build a mental model starting from the stark difference visible in PostgreSQL 17, before moving on to the fancier PostgreSQL 18 behaviour.
Conclusion
Like any index in a database, a multi-column index makes queries faster. It’s an index on multiple columns, but it can even make queries only involving some of those columns faster, but with differing behaviour depending on which subset of columns are involved.
Thinking about the index as an ordered table of the indexed column’s data gives some intuition for why a query with only the first indexed column is much faster than a query with the second (or later) index column.
-
Yeah, looking for a red-brown-ish knife-fork is common. ↩
-
Benchmark methodology, for all numbers in this article:
- Software: PostgreSQL 17.7 and 18.1 running via Podman (results are 17.7 unless otherwise stated); Python 3.13.11 & psycopg 3.3.2 running natively; M1 MacBook Pro; macOS 26.0;
uv run bench.py. - Scenario: 20 million rows inserted into the
cutlerytable structure above, usingrandom(1, 2000)::textfor bothkindandcolourcolumns (independently). - Queries:
SELECT * FROM cutlery WHERE kind = '17' AND colour = '630'and the two single-column versions of that. - Measurement: for each combination of database version, query and indices: create the indices (if required), 3 unmeasured warm-up executions, then 15 measured ones with
EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, BUFFERS OFF), parse the database’s reported time from the output and take the minimum of the 15. - Code and results.
Full code reproduced here, too.
# /// script # requires-python = ">=3.12" # dependencies = [ # "psycopg[binary]>=3.2,<4", # ] # # [tool.uv] # exclude-newer = "2026-02-04T00:00:00Z" # /// import json import subprocess import sys import textwrap import time import psycopg # Data/query details INIT = """ DROP TABLE IF EXISTS cutlery; CREATE TABLE cutlery ( id BIGSERIAL PRIMARY KEY, -- fork, spoon, knife, straw, strork, etc. kind text, -- silver, red, puce, etc. colour text ); INSERT INTO cutlery (kind, colour) SELECT random(1, 2000)::text, random(1, 2000)::text FROM generate_series(1, 20000) AS t(id); """ QUERIES = { "Both": "SELECT * FROM cutlery WHERE kind = '17' AND colour = '630'", "Kind": "SELECT * FROM cutlery WHERE kind = '17'", "Colour": "SELECT * FROM cutlery WHERE colour = '630'", } # Benchmarked scenarios IDX_KIND_ONLY = ( "CREATE INDEX idx_cutlery_kind ON cutlery (kind)", "DROP INDEX IF EXISTS idx_cutlery_kind", ) IDX_COLOUR_ONLY = ( "CREATE INDEX idx_cutlery_colour ON cutlery (colour)", "DROP INDEX IF EXISTS idx_cutlery_colour", ) IDX_MULTI = ( "CREATE INDEX idx_cutlery_kind_colour ON cutlery (kind, colour)", "DROP INDEX IF EXISTS idx_cutlery_kind_colour", ) # name -> list of (setup, teardown) pairs SCENARIOS = { "No index": [], "Kind-only": [IDX_KIND_ONLY], "Colour-only": [IDX_COLOUR_ONLY], "Both onlys": [IDX_KIND_ONLY, IDX_COLOUR_ONLY], "Multi-col": [IDX_MULTI], } # Benchmark metadata WARMUP = 3 RUNS = 15 # Container info CONTAINER = "bench-pg" PORT = 25432 PASSWORD = "postgres" CONNINFO = f"postgresql://postgres:{PASSWORD}@localhost:{PORT}/postgres" IMAGES = [ "postgres:17.7@sha256:2006493727bd5277eece187319af1ef82b4cf82cf4fc1ed00da0775b646ac2a4", "postgres:18.1@sha256:3bb77a07b9ce8b8de0fe6d9b063adf20b9cca1068a1bcdc054cac71a69ce0ca6", ] def start_postgres(image): # in case a previous run didn't finish cleanly stop_postgres() print(f" Starting postgres using image {image}...") subprocess.run( f"podman run -d --name {CONTAINER} -e POSTGRES_PASSWORD={PASSWORD} -p {PORT}:5432 {image}", shell=True, ) # Give it a bit of time to become ready, and then start polling for readiness time.sleep(1) for i in range(10): r = subprocess.run( f"podman exec {CONTAINER} pg_isready -U postgres -q", shell=True ) if r.returncode == 0: return time.sleep(0.2) raise Exception("postgres never became ready") def stop_postgres(): print(" Stopping postgres...") subprocess.run(f"podman rm -f {CONTAINER}", shell=True) def run_scenario(cur, name, stmts): # Setup for setup, _teardown in stmts: cur.execute(setup) for query_name, query in QUERIES.items(): for _ in range(WARMUP): cur.execute(query) cur.fetchall() # Benchmark times = [] for _ in range(RUNS): cur.execute( f"EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, BUFFERS OFF) {query}" ) for (row,) in cur.fetchall(): # lines like 'Execution Time: 3.750 ms' if row.startswith("Execution Time"): times.append(float(row.split(":")[1].strip().split()[0])) # Report/teardown times.sort() yield { "index": name, "query": query_name, "min": times[0], "median": times[len(times) // 2], "max": times[-1], "times": times, } for _setup, teardown in stmts: cur.execute(teardown) def run_benchmark(): results = [] for image in IMAGES: print(f"Running {image}...") start_postgres(image) image_results = [] with psycopg.connect(CONNINFO) as conn: conn.autocommit = True with conn.cursor() as cur: print(" Initialising...") cur.execute(INIT) cur.execute("VACUUM ANALYZE") print(" Benchmark...") for name, stmts in SCENARIOS.items(): image_results.extend(run_scenario(cur, name, stmts)) results.append({"image": image, "results": image_results}) data = { "versions": { "python": sys.version, "psycopg": psycopg.__version__, }, "metadata": { "INIT": INIT, "QUERIES": QUERIES, "SCENARIOS": SCENARIOS, "WARMUP": WARMUP, "RUNS": RUNS, "IMAGES": IMAGES, }, "results": results, } print(json.dumps(data, indent=2)) if __name__ == "__main__": try: run_benchmark() finally: stop_postgres() - Software: PostgreSQL 17.7 and 18.1 running via Podman (results are 17.7 unless otherwise stated); Python 3.13.11 & psycopg 3.3.2 running natively; M1 MacBook Pro; macOS 26.0;
-
This table even shows the colour-only appearing to get 2.5% slower, from 480ms to 492ms, but that’s well within error bars and likely just noise. ↩
-
At least, binary search conceptually: in practice, these indices are typically implemented as B or B+ trees. ↩
-
The pedagogically-convenient locations column represents whatever information the database engine needs to find the actual data stored for the rows corresponding to a given index entry, such as the primary key (MySQL), tuple identifier (PostgreSQL), or row ID (SQLite). The specifics seem to vary between DBMSes, and aren’t important for understanding how the “multi-column” part of an multi-column index works. ↩