Correlated to Complete: A SQL Subquery Refactor
I recently constructed a SQL query that preformed rather poorly. The query involved selecting about 10,000 records with a complicted subquery. It was noticeably slow and needed to be improved. It was easy to see that the subquery was running for each record, like a loop. This is known as a correlated subquery. The refactor involved removing the loop and executing the subquery just once. Here’s how it went down:
Correlated Subquery
A correlated subquery is like a loop. It will executed once for each row in the outer query. Thus, correlated subqueries can be costly. Here is a somewhat contrived example using Postgres:
Here, we want the average relative_velocity
of the close approaches for each
asteroid. The avg_velocity
subquery is correlated since it runs for each
selected asteroid in the outer query.
If you’re wondering about the relative_velocity->> 'miles_per_hour'
syntax,
that’s just PostgreSQL JSON operator
for selecting the miles_per_hour
key from the relative_velocity
JSON field.
To demonstrate the expense of such a query, for let’s say 10,000 asteroids
each with several close approaches, we can use the EXPLAIN
function on the
query. Using EXPLAIN
will give us a ton of information about the full
execution plan, but we’re just interested in the Total Cost
for this
demonstration. We’ll use the Total Cost
as a benchmark comparison when
optimizing to use a complete subquery below.
By the way, I am using EXPLAIN (FORMAT JSON)
because I find the format a bit
easier to read. Here are the results for the above query:
There’s a lot more to this, but I’m only showing the first part with
the Total Cost
, which is 236933.34
. Let’s see how much lower our cost can
be when we refactor this query to get rid of the correlated subquery in favor
of a complete subquery.
Complete Subquery
A complete subquery (also just known as a subquery), can manifest itself as a
JOIN
. We’ll take this JOIN
approach and refactor our query to join
asteroids to the result of a subquery like this:
Here, in addition to the avg_velocity
computation, our subquery also selects
the asteroid_id
from close_approaches
, since we use it to join back to
asteroids and group the average velocities per asteroid. The important thing here
is that the subquery runs only once, rather than for each asteroid like the first
example. All we need to do is join the asteroids
to the result of the subquery.
Overall, re-writing the query this way makes it a bit more verbose and complex, but lets see how it performs:
EXPLAIN (FORMAT JSON)
…
In the first example the Total Cost
was 236933.34
. After refactoring to a
complete subquery we’re down to a Total Cost
of 8600.67
. That’s over 27
times cheaper! 🙂
Conclusion
Correlated subqueries aren’t always bad. Sometimes they may perform well enough.
But if you’re finding it too slow, look to refactoring the query to use a
single subquery on which to join. Remember to use EXPLAIN
for comparison
bench-marking.