In today’s data-driven world, data is as valuable as gold. It powers decision-making, strategy, and growth, yet many companies find themselves struggling to harness the full potential of their data.
Recently, I found myself in a conversation with a potential client who was facing just such a challenge and were ready to explore growth. They were relying solely on a structured database for their production needs, with no advanced analytics or data warehouse in sight, while connecting BI tools to their one and only database. During our conversation about their situation, it became clear that they were missing out on the transformative power of centralized, organized data.
Warehouse isn’t just a luxury — it’s a strategic necessity for businesses to thrive in today’s competitive landscape.
As we dive deeper into the importance of data warehouses for businesses, it’s crucial to understand the basics: the roles of OLTP and OLAP databases.
OLTP and OLAP databases
OLTP and OLAP are two different types of databases used for distinct purposes.
OLTP (Online Transaction Processing) databases are designed for transaction-oriented applications. Such a database contains current, operational data that is crucial for the functioning of an application or organization. Your production database is an OLTP database.
OLAP (Online Analytical Processing) databases are designed for analytical and decision-support applications. Your data warehouse should be an OLAP database.
There are many differences between these two types.
Data Structure
The first difference that comes to my mind is the data structure or schema. In an OLTP database, data is normalized for efficient querying and analysis: we aim to reduce data redundancy, ensure data integrity and consistency. This helps with improving the efficiency of transactions.
In an OLAP database, the approach differs. Sometimes, we need denormalization to optimize performance. We might expect tables to be either summarized or pre-joined from multiple tables, often resulting in summary tables. Using summary tables ensures that queries are broken down into smaller, more manageable bits, making them simpler and less prone to timeouts.
Data Volume
As mentioned before, OLTP databases prioritize minimizing data redundancy and ensuring consistency to support efficient transaction processing. Data undergoes validation against predefined rules before becoming persistent in the database, leading to relatively smaller data volume, compared to OLAP databases.
In contrast, in OLAP databases, data originates from diverse sources, each organizing data differently. Redundancies and inconsistencies are frequent and companies often retain the data in its original format for reference and use the cleansed version in further analysis. As data accumulates over time, we end up with larger datasets and OLAP databases need to be able to efficiently handle and process such data.
Querying
Users expect applications to be fast as time is money, and speed is often a critical selling point. One of the slowest operations in an application is waiting for a server to respond to a user request. Loading too much information negatively impacts user experience as waiting time becomes longer, so we aim to fetch only the necessary data. Queries in OLTP databases will be less complex and more frequent. READ and WRITE operations are equally present.
In OLAP databases, BI tools like Metabase or Periscope facilitates analytics, data visualization and reporting. When a dashboard is loaded, multiple graphs are displayed, with each executing its underlying, potentially complex query. Such a query might contain sorting, aggregation, and joining of multiple large datasets. This often results in numerous READ operations. Some data undergoes cleansing or generation overnight to be utilized for further data analysis. Most WRITE operations occur during this period, updating the database with new or modified data for the next analysis cycle.
Resource management
We need to dvelve deeper into resource management.
Concurrency and Connections
We expect users to use an application any time of the day, simultaneously. Because of this, OLTP databases need to be configured to support a high number of concurrent connections. Tune settings related to connection pooling, transaction isolation levels, and resource management.
OLAP databases need to be tuned differently because the number of connections will likely be much lower compared to OLTP databases. Database connections need to be held open for longer periods to support the execution of complex queries on large volumes of data, otherwise queries would time out. When multiple users or applications concurrently submit analytical queries, resource contention can occur, leading to delays in query processing. This can also cause timeout if queries cannot obtain the necessary resources within a reasonable time. Optimizing performance and schema, revisiting indexes can help avoid timeouts.
Disk I/O or Disk Input/Output performance
Disk I/O performance performance refers to the speed and efficiency with which data can be read from or written to a disk storage device like HDD and SSD. It is typically measured in terms of throughput (data transfer rate) and latency (response time).
Where:
- Data transferred is the total amount of data moved between the disk and the system during a specific time period. — megabytes, gigabytes
- Time taken is the duration it takes for the I/O operations to be completed, often referred to as latency. — milliseconds, seconds
Optimizing Disk I/O performance transactional operations in OLTP databases is critical because it can help reducing response times getting longer due to delays in transaction processing. Optimizing disk I/O also helps assuring high levels of concurrency and scalability. Minimizing disk access and maximizing transaction throughput can be achieved with index optimization, data normalization and caching frequently accessed data in memory. Using SSD for OLTP might be a good approach due to lower latency, faster random access and high throughput.
For OLAP databases, using HDDs might be more suitable. HDDs offer higher storage capacities at a lower cost per gigabyte compared to SSDs. This will be advantageous when we need to store large amounts of data. Reads in OLAP databases are often sequential because we scan through entire datasets. Traditional HDDs perform sequential reads relatively well, especially when the data is stored in contiguous blocks on the disk. Although the operating system typically handles disk organization, regular indexing and vacuuming (if we are using PostgreSQL) can help optimize performance.
In these scenarios, the bottleneck is often the speed at which data can be transferred from the storage medium to the CPU, rather than the access performance of SSDs. The choice between SSDs and HDDs will likely depend on costs from your cloud provider.
Memory
In OLTP databases, user experience is better if response time for transactional operations is minimal.
If frequently accessed data and database objects is cached in the memory, less disk I/O operations are needed, therefore we save time. Allocating memory to the OLTP database cache is going to achieving this expected performance boost.
Memory allocation is necessary for handling concurrent transactions, query processing, transaction management, and temporary storage for intermediate results. The buffer pool can hold frequently accessed data pages in memory, this way we again minimize disk I/O.
Some database systems supports allocating memory for an index cache to store frequently used index pages for fast data retrieval.
Use fast, low-latency RAM modules; for instance, DDR4 or DDR5, which can enhance performance. Cloud providers often offer instance types with higher memory speeds and larger memory capacities, enabling optimized memory access performance.
In OLAP databases, some delay is expected between executing the query and receiving the results. The problems appear when queries start timing out, and the first thing is not necessarily throwing more money at it in a form of RAM upgrades. Instead, preloading frequently accessed data into memory during system startup or maintenance windows can improve query response times and system performance. For example, PostgreSQL has an extension called pg_prewarm that allows loading a schema and a table into memory. Not all databases have this option, for example Apache Hive does not have such a mechanism.
Disk spills occur when there is not enough memory to hold all the temporary data structures or results during query execution. In this case the system decides to write this data to the disk. Disk spills have a negative impact on query performance. Even though writing metadata on the disk is slower, at least we have control over the process if we do it intentionally. Writing metadata on the disk can serve as a temporary storage and eventually be wiped when not needed. Compared to upgrading memory, purchasing additional disk space can be a more cost-effective solution.
Columnar compression techniques can also reduce memory usage and required storage space. A common example can be RLE (Run-Length Encoding), that replaces consecutive identical values with a value and a cound, reducing the storage space required for repeating values:
#uncompressed form stores 11 values
[1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
#compressed form stores the distinct values, each with the number occurences
[1(4), 2(2), 3(5)]
Another example is bit packing which replaces multiple values using fewer bits by packing them tightly together:
[true, false, true, true, false, false, true, true]
[10110011]
There are many other techniques that might be discussed in a separate article. Columnar compression is not just limited to extreme situations; it’s widely used in modern data warehousing and analytics environments. Today, still not all databases support columnar compression natively but through third-party extensions. Apache Hive and Amazon Redshift has built-in support, Google BigQuery automatically applies compression algorithms to your data to optimize performance and reduce storage costs.
We can allocate a smaller portion of memory to the database cache than we would allocate for OLTP. A significant portion of memory should be allocated to support the processing of complex queries on large datasets.
CPU
CPU configuration for OLTP databases should include multiple CPU cores to handle parallel processing of incoming transactions. The CPU needs fast access to memory to support frequent reads and writes to the database.
CPU configuration for OLAP databases depends on the complexity and volume of the queries. Parallel processing is achieved by configuring more CPU cores with higher clock speeds and breaking down tasks into smaller parts, aims to execute these parts simultaneously across multiple CPU cores or nodes, enhancing performance and efficiency.
Small-scale data warehouse solution
Let’s assume that we have a production database with 15 GB data. A small-scale data warehouse solution requires modest hardware specifications. A potential setup is a server with 8–16 CPU cores, clocked at around 2.0–3.0 GHz, coupled with 32–64 GB of RAM. For storage, a basic setup might involve a RAID array of hard disk drives (HDDs), offering several terabytes of storage capacity. The network infrastructure needs to be capable of handling the data transfer between the database server and client machines. A self-hosted solution could be a good enough choice for your database needs at first. It should be suitable for handling moderate volumes of data and supporting basic analytics and reporting functionalities. Once all interesting production data is replicated to the data warehouse, growth teams get very excited about pulling in more and more data. Keep in mind that your production data also is going to grow and you will need to keep an eye on your metrics. I recommend to do regular checks every or every other month to avoid surprises and sleepless nights. Sooner or later, you will need to tune your configuration and eventually migrate to another, more advanced solution or cloud-based services for your data warehouse.
Terms
OLTP — Online Transaction Processing — transactional database
OLAP — Online Analytical Processing — data warehouse