If you have dealt with any data platform in the last few years, you have likely heard about the movement from traditional ETL (Extract Transform Load) to ELT (Extract Load Transform). The name ELT is self-explanatory as the sequential order of tasks is switched such that loading comes before the transformation, but it takes a little more investigation to understand why the movement has taken place and how it is even possible. To illustrate the evolving architecture of data platforms in the last decade, we will take a look at the difference between Costco and McDonalds.
The Old Model (ETL)
In the past, the data warehouse has been a place to consume data. Whether it was viewed through a business intelligence tool, consumed through an API, or displayed in some other application, the data was uniformly structured for easy consumption. Very little (if any) modification was done to the data within the actual warehouse. It looked very similar to a fast food restaurant. In most places in America, you can reliably go to McDonalds and get a Big Mac that is generally the same as a Big Mac you would get at any other McDonalds. Do you know why? It’s because the Big Mac is only lightly prepared at the actual restaurant. Most of the work is done before the meal arrives, and the Big Mac you have in Los Angeles might even come from the same place as the one you have in Miami. There is a factory somewhere that acquires the food, turns it into a Big Mac, freezes it, and sends it to each individual McDonalds.
Why is this a bad thing? Well it isn’t inherently bad, just like the old data warehouse model isn’t inherently bad. Much like McDonalds, data warehouses have traditionally receieved their data already prepared. The reception is the Load part of ETL, the final step. But where do the first two steps happen? Again like McDonalds, at the “factory”. ETL tools, such as Oracle Data Integrator or SAP, target all kinds of data sources (databases, flat files, etc.). They first Extract the data, then Transform it, before Loading it into the data warehouse.
The transformation process is arguably the most important part of the process, and also the most computationally expensive. Therefore, these ETL tools have to be able to access vast amounts of compute, and usually run on servers optimized for those types of tasks. The data warehouse, on the other hand, is not constructed to do heavy computation. It is optimized for storage and is not easily scalable. This made sense when companies with lots of data were buying their own servers, and managing their own data centers. It could take months to acquire a new server, and the database systems used for a data warehouse were not built to be distributed across multiple servers.
This setup is logical, given the constraints of the time, but it leads to potentially stale data and long wait times between data creation in a source system and its appearance in BI tools that consume from the data warehouse.
The New Model (ELT)
In direct contrast to McDonalds, Costco offers food (and lots of other items) at a variety of stages of production. You can buy raw meat, or you can literally eat a fully prepared meal at tables in the front. Depending on what type of consumer comes to the store, they will choose to buy raw goods or buy prepared food. Some may even do both at different times.
The old ELT model has evolved in direct correlation to the rise of cloud computing. When we look back at the constraints that led to the setup of the traditional data warehouse, it is apparent that many of them have been solved by cloud provider managed services. First of all, it is much easier to scale infrastructure up and down, and you can even do it automatically based on the observed load of existing machines. What used to take months now takes seconds.
Secondly, the database systems have been optimized to take advantage of distributed architecture, and employ techniques such as massively parallel processing, result caching, and concurrency scaling. This enables virtually unlimited horizontal scaling of your data warehouse, depending on the computational load and amount of data. Because of the increased computing power in the data warehouse itself, the transformation can be done within the warehouse, leading to the pattern of Extract from sources, Load into the warehouse, then perform all transformation there.
The advantage of doing the transformation within the warehouse, much like when shopping at Costco, is that data becomes available for consumers in its various forms as soon as it is ready. An application or BI tool may want to see raw source data that is fresh, while large summary queries are still being executed. The big idea is that bottlenecks are removed: you never have to wait for anything when it is already ready to be consumed. This functionality can be further improved by implementing techniques such as a lambda architecture.
One final way that Redshift, Snowflake, and other columnar databases are similar to Costco is that they cannot store unstructured data (i.e. Costco sells raw chicken but you can’t buy the live chicken). The tradeoff for this is that columnar data is very fast and efficient to query, and can be compressed due to the fact that data in a column is of a uniform type.
Questions & Thoughts
If you have any questions or thoughts, please contact me on Twitter @HashedDan!