Tips to Improve the Performance of a PostgreSQL Data Warehouse

BatCat
2 min readOct 14, 2019

--

Postgres is a very powerful DBMS and it’s time for some advanced features to enjoy the spotlight.

Motivation

The company I am currently working for started having troubles with their data warehouse. To have a better understanding of our data, we use a BI tool and create fancy dashboards with colourful charts.

As we grew, our data started growing rapidly and the otherwise ‘ok’ queries started to slow down. Slow queries started to become a problem when the execution time hit the BI tool’s limits: the timeout window. If a query does not return any result in ~4 minutes, our BI tool gives up on waiting and cancels the query. Increasing the timeout window by 50% helped for a while but as the data continued growing rapidly, more and more charts were assassinated.

Buying better hardware was one solution that worked. At least for a while. Literally, we were just buying time because some tables doubled in size in every 3–4 months. Paying more and more every couple of months was clearly not a long-term solution.

Solution

I read lots of blogs, books and re-read my notes from my university years to figure out how to speed our queries up and get the max out of our DBMS.

I came up with a recipe which you may also use if you want to speed things up in your data warehouse:

  • use more materialized views if you can handle stale data,
  • add indexes to materialized views
  • migrate data-related cronjobs which are dependent on each other to Airflow (you get better control of your workflow)
  • check if the indexes are used and drop the ones that are not
  • create indexes only on tables which do not have a frequent update activity
  • use VACUUM regularly to cleanup dead rows (or FULL VACUUM if really necessary),
  • do not turn off autovacuum,
  • look for database locks — check pg_stat_activity,
  • when writing queries, SELECT only what is necessary,
  • sort only when necessary,
  • be careful with CTEs and subqueries (it depends on your use case which is more beneficial!),
  • record slow queries in your database and check them: hit EXPLAIN, stare at the execution plan for a while, try to understand what it says and finally,
  • learn to write better queries: pick a slow query and try reducing the cost

--

--

BatCat
BatCat

Written by BatCat

My mission is to share some of the solutions I find during my journey as a data engineer. I mostly write about PostgreSQL and Python.

No responses yet