Inside the SQL Query Planner: What Happens When You Hit Execute?
An in-depth exploration of the engine that drives your SQL query results.
Ever wondered what really happens when you hit ‘Execute’ on a SQL command? Curious about the hidden algorithms that bring your queries to life? Or how you can craft your SQL for maximum efficiency?
In this article, we’ll dive deep into the fascinating journey your SQL query takes within PostgreSQL, uncovering the query planner and the strategies behind optimal query writing.
1. The Command: Crafting Your SQL Query
SQL is a declarative language, which means you specify the desired outcome of your query. You describe the result you want, and the database system determines the most efficient way to retrieve or manipulate the data. This allows you to focus on what you need, while the database handles the complexities of how to get there.
Let’s imagine your client application (DBeaver, pgAdmin, Metabase or other) is connected to your PostgreSQL server, and you’ve just written your query. After hitting ‘Execute’, you’re now waiting for the results to come back. So what is Postgres doing exactly while you are waiting?
+------------------+
| Raw SQL Query |
+------------------+
|
v
+------------------+
| Parser | <--- Converts SQL text into an Abstract Syntax Tree (AST)
+------------------+
|
v
+------------------+
| Analyzer | <--- Validates AST for schema correctness, type compatibility, and permissions
+------------------+
|
v
+------------------+
| Query Rewriter | <--- Transforms the query for optimization (e.g., query simplifications)
+------------------+
|
v
+------------------+
| Query Optimizer | <--- Optimizes the query execution plan (e.g., choosing indexes)
+------------------+
|
v
+------------------+
| Execution Plan | <--- Final plan for executing the query
+------------------+
|
v
+------------------+
| Executor | <--- Executes the query using the execution plan
+------------------+
|
v
+------------------+
| Results/Output |
+------------------+
Query Parsing
PostgreSQL starts with giving your command to the Parser to parse your SQL command. The parser breaks down the query into individual tokens and checks it for correctness. The parser identifies and categorizes these tokens based on recognized keywords such as SELECT
, UPDATE
, and others.
Query Validation
After the parser completes its work, the query moves to the validation phase. At this point, the analyzer takes over and operates on the Abstract Syntax Tree (AST) produced by the parser. The role of the analyzer is to ensure that the query is meaningful and executable within the database context.
The analyzer performs several critical tasks to achieve this. First, it verifies that all tables, columns, and functions referenced in the query actually exist and are accessible. This is known as schema validation. Next, the analyzer checks that the data types used in the query are compatible and that any operations on these types are valid, which is referred to as type checking. Additionally, the analyzer ensures that the user executing the query has the necessary permissions to access and modify the referenced tables and data, a process known as permission checking.
Once these validations are complete, PostgreSQL can be confident that the query satisfies the SQL syntax rules, is logically valid, and can be executed correctly without errors related to non-existent schema objects or type mismatches. The validated AST then becomes the foundation for further processing, like query rewriting and optimization.
Let’s assume that this was the query you wrote and was received by the parser:
SELECT first_name, last_name
FROM employees
WHERE department = 'Sales';
The AST for the query above looks like this
SELECT
├── Columns: [first_name, last_name]
├── FROM: employees
└── WHERE: department = 'Sales'
Now a more complex example:
WITH department_sales AS (
SELECT department, SUM(sales) AS total_sales
FROM employees
WHERE sales > 1000
GROUP BY department
),
top_departments AS (
SELECT department
FROM department_sales
WHERE total_sales > 50000
),
final_result AS (
SELECT e.first_name, e.last_name, e.department, e.sales
FROM employees e
JOIN top_departments td ON e.department = td.department
WHERE e.hire_date > '2020-01-01'
)
SELECT *
FROM final_result
ORDER BY sales DESC
LIMIT 10;
And its corresponding AST:
WITH
└── department_sales (CTE)
├── SELECT
│ ├── Columns: [department, SUM(sales) AS total_sales]
│ ├── FROM: employees
│ └── WHERE: sales > 1000
└── GROUP BY: department
└── top_departments (CTE)
├── SELECT
│ ├── Columns: [department]
│ └── FROM: department_sales
└── WHERE: total_sales > 50000
└── final_result (CTE)
├── SELECT
│ ├── Columns: [e.first_name, e.last_name, e.department, e.sales]
│ ├── FROM: employees e
│ └── JOIN: top_departments td ON e.department = td.department
└── WHERE: e.hire_date > '2020-01-01'
SELECT
├── Columns: [*]
├── FROM: final_result
├── ORDER BY: sales DESC
└── LIMIT: 10s
The AST examples show how the database processes queries and that simple queries have straightforward processing paths, while complex queries will result with larger, often nested ASTs.
Can you see the AST from your client?
The AST is not directly exposed to end users. You can get this information by enabling detailed debugging and combining tools or extensions that allow you to visualize the query structure.
Running EXPLAIN will give the execution plan generated by the query planner, it tells you how the query will be executed:
EXPLAIN SELECT first_name, last_name FROM employees WHERE department = 'Sales';
Running EXPLAIN ANALYZE will execute the query and then it will also display the execution plan along with actual execution times:
EXPLAIN ANALYZE SELECT first_name, last_name FROM employees WHERE department = 'Sales';
If you are comfortable with changing the server configuration, you can enable query debugging to get the parse tree. Settings these parameters will log the parse tree structure to the Postgres logs when a query is executed.
debug_print_parse = on
log_statement = 'all'
To see how the query is rewritten and parsed, you can turn on the log parser statistics:
SET log_statement = 'all';
SET log_parser_stats = 'on';
WARNING!
In production environment, enabling the above logging and debugging options can be dangerous because of performance degradation, security vulnerabilities, and log management challenges!
- Logging all statements will increase the amount of logging data, leading to higher I/O operations on the disk. This includes all queries, DML operations (INSERT, UPDATE, DELETE), and even simple SELECTs.
- Logging parser stats and a verbose debug output will quickly consume disk space.
- Sensitive information might be exposed through the logs
- Noise in logs will make it difficult to find useful information
- Backup and restore might also slow down due to logging
… and the list goes on.
Early Optimization
After validation, the Rewriter takes the parsed query and performs initial optimizations and transformations to simplify and restructure it. This process can improve the query’s efficiency before it reaches the query planner. The rewriter’s tasks include simplifying expressions, expanding views into their underlying queries, and flattening subqueries.
These early optimizations help to simplify and improve the query to prepare it for the more detailed optimization performed by the query planner, setting the stage for more informed and effective planning decisions.
2. The Plan: Decoding the Query Planner’s Role
Execution Plan
An execution plan is a blueprint created by a database system to execute the query. It defines the steps the database will take to retrieve or modify data: how to access tables, join data, what is the sequence of operations and the methods used to access and process data. By relying on the execution plan, the query is executed in the most efficient manner possible.
PostgreSQL converts the AST into a detailed execution plan.
The planner evaluates various execution strategies: it considers applying different join algorithms and access methods to find the most efficient way to execute the query. The result of this step should ideally be a plan that minimizes resource usage and execution time.
Can you influence the execution plan?
Yes. There are a few obvious things that can help affect the execution plan. For example, you can use the most efficient joins and select only necessary columns helps PostgreSQL generate better plans. you can create indexes on frequently queried columns to speed up data retrieval.
Some less obvious things you can do is to understand how statistics are used and processed and how to flush them.
You can also adjust PostgreSQL configuration settings and run VACUUM
and ANALYZE
commands can enhance the accuracy of the execution plan by updating table statistics and improving resource allocation.
For more about vacuuming, check out these articles:
- Article 1 about such operations in general, detailing standard VACUUM, VACUUM FULL, VACUUM ANALYZE
- Article 2 about VACUUM VERBOSE, VACUUM FREEZE
- Article 3 about VACUUM ONLY DATABASE STATS, VACUUM DISABLE_PAGE_SKIPPING
Optimization
To optimize the query, Postgres uses the component called the Query Optimizer. This component applies a few different techniques to improve query performance.
First, Postgres estimates the resources needed for different ways of executing the query. It tries to predict the amount of disk I/O and CPU time each plan will require. Postgres uses statistics about table sizes, index effectiveness, and data distribution to make these estimates.
Next, Postgres selects the execution plan with the lowest estimated cost. This plan is expected to be the most efficient in terms of resource usage.
During plan generation, Postgres explores different strategies for executing the query. In this step, the methods for accessing and joining data are explored. It’s important to distinguish between the actual JOIN operations specified in your query and the join algorithms used to execute these operations. A join algorithm is a method used by a database management system to combine rows from two or more tables based on a related column between them. When tables need to be joined, the join algorithm determines how the database engine will find and merge the rows from these tables efficiently. The most common join algorithms are sequential scans, index scans, nested loops, and hash joins.
During cost analysis, the system evaluates how much each strategy will cost. Based on the results, the plan that provides the best performance and efficiency is picked and used to execute the query.
To see the join algorithm and costs for your query, you can again use the EXPLAIN
command. Let’s assume that this is your query:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;
This is the query with the EXPLAIN command:
EXPLAIN
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;
This is one possible output, based on your system’s statistics available at the moment of execution:
Nested Loop (cost=0.00..35.50 rows=100 width=32)
-> Seq Scan on employees e (cost=0.00..15.50 rows=100 width=16)
-> Index Scan using departments_pkey on departments d (cost=0.00..0.20 rows=1 width=16)
Index Cond: (id = e.department_id)
The output is represented as a hierarchical structure of nodes and subnodes. Each node represents an operation in the query execution plan, and subnodes represent the operations performed as part of that node.
In the example, the top-level node contains the information that a nested loop join is planned. The subnodes contain information that the table “employees” will be sequentially scanned, and the table “departments” will be scanned using the existing index on the primary key.
The cost values are internal metrics used by PostgreSQL to estimate the computational resources required. They are not directly related to real-world units like seconds but are useful for comparing different query plans. There are two values for cost, where the first is the startup cost, the second is the total cost. The startup cost is the amount of time or resources needed to begin executing this node. The total cost is the sum of the startup cost and the cost to process all rows returned by this node.
The number of rows (100) the number of rows that the planner expects the current node to return.
The width is the average size of each row in bytes. In this case, 32 bytes is the estimated average width of each row returned by this node. This helps Postgres estimate how much memory or disk space will be needed to store the result set.
If you run your query with EXPLAIN ANALYZE, the query will be executed and you will get both the estimated and actual performance metrics.
Example:
EXPLAIN ANALYZE
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;
The query above will provide an output with the following format:
Nested Loop (cost=0.00..35.50 rows=100 width=32) (actual time=0.015..0.028 rows=100 loops=1)
-> Seq Scan on employees e (cost=0.00..15.50 rows=100 width=16) (actual time=0.010..0.018 rows=100 loops=1)
-> Index Scan using departments_pkey on departments d (cost=0.00..0.20 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=100)
Index Cond: (id = e.department_id)
Planning Time: 0.126 ms
Execution Time: 0.052 ms
There are some differences. The actual time is added that was needed to execute the query, from start to finish. The rows next to the actual time will show the actual number of rows returned by the node, similarly for sub-nodes. There is some extra information (Index Cond) which shows the condition used for the index scan, where the id from departments matches the department_id from employees. You can also see planning time which is the time taken to plan the query, and execution time, which is the total time taken to execute the query, in milliseconds.
Your query plan will likely differ from this example. The exact execution plan, including the nodes and their cost estimates, can vary based on your database’s statistics, such as table sizes, index availability, and data distribution.
TIP: Try experimenting with different ways to write your query and check if cost grows or decreases!
3. The Execution: Bringing the Plan to Life
The Executor is responsible for executing the plan generated by the Query Planner. It performs the actual data retrieval, processes the data according to the plan, and produces the final result.
The Executor systematically traverses and executes each node of the chosen execution plan, handling data operations as specified at each step. The operations done in each node is data retrieval, data processing and intermediate results handling. During data retrieval, tables or indexes are accessed to fetch the relevant nodes. During data processing, filters, joins, aggregations and sorting is applied. If there are subsequent nodes that require the intermediate results, such results will be passed down to them as defined in the execution plan.
When the Executor is done with the final node, it compiles the collected result set (including formatting) and returns it to the client.
In the meanwhile, the Executor also manages system resources (memory, CPU) to make sure that the query execution doesn’t impact the performance of the DBMS.
After returning the results, the cleanup process is next. The executor releases memory and other resources to avoid leaks.
If an error occurs during query execution, the Executor may attempt to retry the failed operation to resolve the issues. If the error persists and cannot be resolved, the Executor will roll back the transaction. This will ensure data consistency and database integrity. The rollback undoes any partial changes made during the transaction.
Thank you for reading! If you found this article helpful, please hit the clap button to show your support. If you’d like to see more articles on database management and query optimization, leave a comment with your suggestions.
Yours, Batcat