Postgres is a very powerful DBMS and it’s time for some advanced features to enjoy the spotlight.
Motivation
The company I am currently working for started having troubles with their data warehouse. To have a better understanding of our data, we use a BI tool and create fancy dashboards with colourful charts.
As we grew, our data started growing rapidly and the otherwise ‘ok’ queries started to slow down. Slow queries started to become a problem when the execution time hit the BI tool’s limits: the timeout window. If a query does not return any result in ~4 minutes, our BI tool gives up on waiting and cancels the query. Increasing the timeout window by 50% helped for a while but as the data continued growing rapidly, more and more charts were assassinated.
Buying better hardware was one solution that worked. At least for a while. Literally, we were just buying time because some tables doubled in size in every 3–4 months. Paying more and more every couple of months was clearly not a long-term solution.
Solution
I read lots of blogs, books and re-read my notes from my university years to figure out how to speed our queries up and get the max out of our DBMS.
I came up with a recipe which you may also use if you want to speed things up in your data warehouse:
- use more materialized views if you can handle stale data,
- add indexes to materialized views
- migrate data-related cronjobs which are dependent on each other to Airflow (you get better control of your workflow)
- check if the indexes are used and drop the ones that are not
- create indexes only on tables which do not have a frequent update activity
- use VACUUM regularly to cleanup dead rows (or FULL VACUUM if really necessary),
- do not turn off autovacuum,
- look for database locks — check pg_stat_activity,
- when writing queries, SELECT only what is necessary,
- sort only when necessary,
- be careful with CTEs and subqueries (it depends on your use case which is more beneficial!),
- record slow queries in your database and check them: hit EXPLAIN, stare at the execution plan for a while, try to understand what it says and finally,
- learn to write better queries: pick a slow query and try reducing the cost