ETL Design Pattern : E-LT-L
I was ramping up C# to create BI Framework and I hit upon the term “Design Patterns”. I decided to go through few patterns only, as I did not have patience to complete the book, also I assumed I don’t need it. Finally at the end of day, I was able to implement a couple of patterns in my BI Framework application. During this time, I wondered whether there are any pre-defined design patterns in Data Warehouses, ETL, Cubes, and Universes. My quest begins … In this blog, I will focus on ETL alone.
I started looking back at my projects to analyze if I had used any design patterns or whether I hit upon the recurring problems that could have been solved with alternate designs. What I found was quite interesting. I have been using best practices defined for each component all along. Then why were we slogging all the time. What were the problems we were facing every day? Now the right time! I decided to plunge deep in each component to find out exactly what I could have done differently during design which could have made my life easier. And that is where I came up with an ETL design pattern E-LT-L.
Before I explain what this E-LT-L means, let’s look at ETL first. ETL stands for Extraction, Transformation and Loading. In ETL we extract data from multiple data sources and transform the incoming data in a format compatible with data warehouse structure, followed by loading into data marts. Usually companies use some commercial tools like SSIS, BO Data Integrator Designer, etc… The developers then make full use of these tools and they end up using most of the functionality provided in these tools. And to an extent, this looks right too, as the companies have made an investment for this purpose.
Finally when companies, evaluate their ROI, the results are amazing. By following industry standards and using commercial ETL tools, training their development teams, the results do not look good. A new set of problems like performance problems, steep learning curves, fixing the parts that are not broken, buying additional hardware, etc. have come up. From ETL perspective, even though, most companies have a designated job server, they do not get a good performance. After a while in production, when there data marts grow in size or the size of incoming data increases, the performance of ETL job takes a hit. To resolve these issues, many things are done like increasing configuration of job server, make changes in database structure, use bulk loading options (favorite choice for techies), split the jobs, pushing few things like summarization over to weekends, etc.. This results in making the system more complex than it should be which has a direct hit on overall IT cost.
So exactly what went wrong?
This is where a new pattern E-LT-L comes into existence. Most of the recurring problems can be resolved using this design pattern.
E-LT-L stands for Extraction, Loading and Transformation, and Loading. This basically suggests that, once the data is extracted, instead of applying transformations (T) in the staging area, load this data into data mart; and then apply transformations there (LT). Since, incoming (raw) data is in data mart already, it would make more sense to use database objects (stored procedures) for transformations instead of row-based transformations available in ETL tools. Using database based transformations would resolve most of the problems.
Effectively, this pattern calls for BULK loading and transformations at correct place without moving huge amount of data around.
This may sound strange and many people would agree to debate. To prove the point, I would present few scenarios and let you decide what is good for your implementation.
Scenario 1: Let’s assume you have 1 million incoming records that need to do look up for say customers. No matter what tool you use and how much you configure it, it has to run some sql against customer master, which is typically huge in may DW installations, to get the customer code. This will become a bottleneck as the sql would be executed multiple times. Also to add to the woes, the “customer code” column value has to travel from database server to job server and is stored in the placeholder (variable) in incoming row. Instead if you code the lookup transformations, in stored procedure, with one sql you can update all the rows by simply joining the staging table with customer master table. By all counts, the performance of this sql cannot be beaten.
Scenario 2: We have incoming product master for a large retail chain. We need to implement slowly changing dimension type 2 here (insert new records and update existing records if already present. Any ETL tool would implement this transformation row by row and it can get painfully slow. This transform can be easily developed in stored procedure using MERGE sql statement. For more experienced developers this can be made meta data driven.
To summarize, the transformations done using stored procedures help as follows:
- It does not move data between database servers and job servers. We get a good boost in performance.
- The data is processed in bulk.
- Database engine (Oracle,SQL Server) is designed to do this job in the most efficient manner than any third party ETL tool can do.
- It is always easy to fix/debug the stored procedure than some transform in ETL tool.
- Small learning curve for developers to ramp up on the ETL tools.
- Deployment of ETL packages (which is surely a problem in ETL tools). It takes a lot of effort.
I am sure, many of you would be thinking then why we shall use any commercial tool for ETL at all. These tools have their own significance. We would need a commercial ETL tool to achieve other goals of ETL, such as:
- Defining a workflow where we decide upon what tasks shall execute and in what order. What to do if some task fails.
- Executing tasks in parallel.
- Even though a stored procedure is created to transform some data, it needs to be called from an appropriate place in ETL.
- Use the logging/notifications capabilities of ETL tool, which usually are very efficient and simple.
- Use the scheduling and other features readily available in commercial ETL tools
What this E-LT-L pattern suggests is the design of ETL architecture. By having good ETL tool and good design of ETL, we can write efficient and manageable jobs.
If you have any questions or comments, you can reach me at jagdishm@aditi.com.
Hi could you suggest books that might help me better understand ETL patterns. I’m looking for best practises in defining formats and standards for extracting data from various ERP systems.
Hi do you have any suggested readings to help me better understand various ETL patterns ? I am attempting to define standards and a format for extracting data from 3rd party ERP’s and would like to know if you have any helpful ideas on best practices ?
@Sonny
Hi Sonny,
Ralph Kimball’s book on ETL is a good source to start with. He has explained the standards and document templates.
Hope this helps.
Jagdish Malani
Hi,
I need help designing a data migration tool. In which my source can be any form(Excel, DB, flat file, etc) for now, i am taking an excel file with few tables and columns. I want to map columns from source to columns on teh destination. How do i store this mapping in teh DB?
First of I would say good post, the only thing that i feel re-usability of the code (transformations/jobs) are missing here , here is a use case :
1. If you are working on a big project with a dynamic company then you will find problem with stored procedures approach , because similar data might be coming from different DB engines/source ( files) . Even if you choose to work in data mart but still there is a issue that because of cost issue company will move to some other DB engine (MART) in next few yrs then additional cost will come into picture .
In that scenario , my recommendation would be to design ETL tool transformation/jobs such that you create a design pattern out of them in the most efficient manner and reuse. Hence it will be cost effective and good in performance.
~Diwakar
Hi,
This sounds like a good idea however what will be the impact on the design of the data marts as they need to accommodate all the leaf level and or staging type of tables. Data marts which are usually supplied with only clean data will contain the raw data as well since the data is directly loaded from source. Shall we still need a DW or DMs alone will suffice? Have you come across any other ETL desgin pattern than this?
Once again, appreciate your idea and would like to hear more from you.
Regards,
Ashish
@Jagdish Malani
Hi jagdish,
Usually Staging area stores a days data or the data that ws extracted from source however was not loaded in target due to some failure.
In such case if we keep the staging area on the target server, then we may not require transformation after load.
Secondly, would applying transformation after load be really a good option?
If your target table has millions of records and you start updating them even though in batch, it will definitely eat up you time.
Hi Shamit,
Sorry for replying late. Was running busy with work.
Coming to your point, The purpose of staging area is to keep the incoming data at some designated place so that while transforming/validating/etc, we do not need persistent connection to source. The point to remember is that this data is still comsidered raw and not fully massaged. As you said, we do not store more than a days (or some specific period) unless required for some reasons.Thus, it does not matter where is your staging area, it is not usable from OLAP perspective. So you need transformation anyway.
For your second point, I mean staging area should be as close as possible to DW. In that case, we have flexibility to implement transformations in SQL. Let’s consider a scenario: we are using say SSIS, where we need to do SCD type 2 for say 1million incoming records. Even if your job server and DB server are on the same box, it will be slower. You would resort to lookup transformations and some so called advance stuff in SSIS to tune it.
With SQL, you would have more control and best performance.
Hope this helps. Thanks
Jagdish Malani
Hi,
Thanks for the detailed comments (although little bit blunt).
Let me explain what I meant to say in this post:
<<<<<<<<< where you are endorsing an approach of stored procedure over an use of ETL tool/components, in my experience would eventually lead to ridiculous performance problems, extremely high maintenance cost over the years and the solution would essentially have the same limitations as are applicable to a stored procedure, which are enormous.
>>>>>>>>>>>
I am in no way endorsing the SP approach here. The point I wanted to bring up here is
1. bring in the raw data as near as possible to the data warehouse database, so that we have a choice of using SP if required.
2. we shall be able to get away from row by row transformations if required. For instance, slowly changing dimension type 2. If that creates performance problems, we shall have an option to use Merge SQL instead. Even Ralph Kimbal suggests using it if there are lot of performance problems. You can find his article here:
http://www.kimballgroup.com/html/08dt/KU107_UsingSQL_MERGESlowlyChangingDimension.pdf
At the same time, I agree with your view that SPs have limitations. This the the reason, I called this approach as pattern. It is not a guideline that we shall follow.
<<<<<<<<<<<<>>>>>>>>>>>>>
You are tlking about ETL solutions design vs SP Programming. This is not the essence of this post. ETL tools allows for parallel execution which can not be done with SPs. From a maintenance and cost perspective, I would prefer to use Merge SQL statement for SCD type 2 (if I can use it) over ETL tool approach. I would not require ETL specialists working closely with DBAs and other third party vendors. Also consider the scenario, where a compnay started with say SSIS and later they decide to move the ETL implementation to Business Objects.
Moreover, the reasons (pretty blunt) that you have mentioned above are in the real word very real. So we need to be practical and we shall also not be monolithic in our thinking.
Finally there were my thoughts which I posted. I leave it to the people to think about it when they are implementing their ETL solutions. I am in no way proposing rule of thumb as you have thought about it.
Please email me if you have any further comments at jagdishm@adit.com
Thanks,
Jagdish
Hi,
This was a good article. You have pointed most of the common problems that we face in Datawarehouse load.
However I am a little confused on the use of Stored Procedures. Most of the time we have to implement some difficult logic which is little easier to do using an ETL tool compared to writing procedures (maybe that is the primary purpose of ETL…to make a developer’s life a little easy
…. ). Also if we write a stored procedure we may, most of the time, end up using cursors which again does a row by row processing rather than bulk processing. This when we compare to a datawarehouse load (where the number of records is too large), using a cursor can be a bottleneck as it needs a large memory. Don’t you think this problem is better handled by the ETL tool which has its own dedicated memory and some other advance concepts like partitioning and parallelism?
One of the examples we were given was, sorting a large number of records would be better done through a powerful ETL tool than doing it on the database. This is significant as sorting plays an important role in our development.
Another problem we faced was, as the datawarehouse and data marts grows up, even querying a database takes up a long time. In that case if we use ELTL method as you described, won’t it be a little difficult to update (transform) the records in the database after loading them when selecting those records itself takes ages? This problem won’t be that significant if we use ETL approach as we are just loading the database not updating it (and as you said bulk loading feature of ETL is very good at this).
Today we have some very fast databases which are specifically designed for high volume data (like Teradata, Kognitio). This may prove your point too.
As you pointed out, this is a good topic to debate on.
Regards,
Praveen
Hi Praveen,
I am not suggesting to get away from ETL tool. The point I am driving at, is to bring staging area nearer to DW area. That would give us an option to move data using stored procedures as well if the need arises. So this in my opinion is driven by needs and the infrastructure available. Let’s assume, we have to laod huge amount of data. Now I have these options to laod data:
a. Using ETL tool (which may execute a job from another server) and do the transformation and loading.
b. Along with using ETL tool, in some cases I may prefer to move data from Staging database to DW database using say stored procedure.
c. Use mix of both
So what needs to be done is being done. Only the way it is implemented is different. If there are any updates (loading/updating) required in DW database, they would be done by ETL tool or some SP or some SQL.
Also, I agree, we have really powerful ETL tools available these days. And in my opinion, partitioning, bulk loading etc features of ETL tools are available in databases already. So we have to try various options and test the results. Based on that we can make a decision how to implement the solution.
Thanks,
Jagdish