Archive

Posts Tagged ‘meta data’

Future of ETL - Metadata driven

March 26th, 2009

Some time back, I was called to design ETL for a mid-sized enterprise. I had to deal with various issues like multiple data sources, not-so-clean data, various data validations, data cleansing needs, ETL time window within which the ETL job shall finish, changing requirements, and the list goes on. On the top of that I had very tight timelines. Then with the experience in my previous BI projects, wherein I dealt with ETL with different tools (BO, SSIS), I came up with a list of goals, which I thought shall be there in any standard ETL implementation and must be implemented here too. Here are some of the main goals:

  • Performance
  • Flexible enough to handle changing requirements
  • Least maintenance
  • Automation
  • Modularizing processing
  • Data quality
  • Recoverability of ETL job
  • Auditing
  • Logging
  • Notification
  • Ability to reprocess rejected records, files, or any other data source
  • Remote administration of ETL job

While going though this list, I realized that all these features were there in my earlier ETL implementations too. This fact led me to a list of questions like “How I am going to improvise on my earlier designs”, “How can we take ETL to a new level”. Before I set out to find answers to these questions, I asked myself a simple question “Why”. Why shall I improvise my ETL design? What is the need of this at the first place? I found the answers early. Surprisingly the answers were “Reduced development time, with increase quality results in reduced cost for customers”. This can be a good selling point for sales team too, that I would cover in my next blog.

So coming back to original questions as how to improvise ETL design, I started with comparing different ETL tools that I have worked with in the past, looked at how these tools work, how these tools handle metadata, how they eventually run the job. Interestingly, I found that these tools are good enough to meet standard ETL requirements. But these tools do not make good use of metadata. Even if they use metadata at few places, there is no intuitive interface available. Thus I got my first lead and I decided to make use of metadata as much as possible. Hence, quest for “Metadata driven ETL” starts….

Before I tell you what Metadata driven ETL is, let me tell you “what” it is not. It does not generate the ETL program on the fly as you may expect. Instead metadata driven ETL uses an existing program that executes various tasks by dynamically reading parameters from metadata datastore. This datastore can be an xml file, a separate database repository, few tables in DW database.

WHAT IS METADATA: Any data that is used to execute a transform/operation is called metadata. For e.g. from transformation perspective, in slowly changing dimension transformation type II, the target columns that are considered as business keys is nothing else but metadata. Similarly from operations perspective, the list of database tables that must be cleared before the extraction starts, defines metadata. When we use ETL tools’ standard transforms, or hand-code ETL, we use metadata in one way or the other.

USING METADATA:
How shall we use this metadata to our advantage? The best approach to start with metadata driven ETL is to:

  1. Identify different operations in ETL that can be automated. There are certain operations that are carried out during ETL which can be automated. These operations may involve multiple tables. Usually these operations are put into some script or SQL and are executed as one operation. Let’s look at an operation where we clear clearing staging tables before the extraction starts. Usually staging tables are designed as follows:
    1. Create all staging tables in a separate database
    2. If staging area is same as data mart, create a specific schema like “STAGING” or prefixed the table names with something as “STG_”

Whichever way we go, this list of tables can be read from database system tables (metadata). Using this metadata, this operation can be automated. On the similar lines, we can create specific metadata to be used for other ETL operations. This would make the ETL development easier and faster and lesser probabilities of error. Here are the other sample operations that can be made metadata driven:

  • Moving rejected records to rejected area
  • Reprocessing rejected records
  • Data cleansing: For all incoming data, based on unique keys identified, duplicate records need to be deleted. For this, there isn’t an in-built transform in ETL tools. This can be made metadata driven. ETL would read unique keys for each incoming business entity and dynamically creates and executes it. Thus, developer would just need to specify the unique keys in metadata datastore. This again would bring down the development effort.
  • Notification: Upon completion of an ETL job, it can read metadata (from address, recipients addresses, mail server, subject, body) and this process can be made metadata driven. We all do this already. What is different here is how we implement it.
  • Auditing and logging
  • Identify transformations that can be made metadata driven: There are lots of transformations that can be made metadata driven. We need to identify those transforms and make it metadata driven. For example, let’s see the slowly changing dimension type II. This transformation is heavily used in any ETL program. In this transform, incoming records are compared for pre-defined key columns to the existing records in data mart. If the records exist in data mart they are updated, otherwise they are inserted. For every SCD II, the transform is to be coded. This transformation needs following metadata:
    • Source and target table names.
    • Business keys column names in source table.
    • Key columns in target table.

With the tools like BO or SSIS, this transform is kind of hardcoded. To make this transform metadata driven, read the above mentioned metadata and dynamically create “Merge SQL statement” (available in DB2, SQL Server 2008) in a stored procedure. This stored procedure would be called from the ETL at appropriate places, for all SCD-II transforms. This metadata driven SCD-II transform would give you the best performance. Let’s consider that the staging area is within the data warehouse database or in a separate database on the same server. If some ETL tool’s transform is used, the data is processed in batches of some pre-defined size. ETL engine would apply this transform and fire appropriate SQLs for every incoming row in the batch which is definitely slower. However in metadata driven transform which is a SQL operation, it processes all the data in bulk and we get increased performance.

IMPLEMENTING METADATA: When we use any ETL tool for transformations or for other purposes, we are in fact using metadata. The difference lies in implementation and the usage of metadata. How and where this metadata is stored? How do we access this metadata at run time? Is there any single interface available to access metadata? When we use ETL tool for some transformation, this metadata is stored in the proprietary format. We cannot simply go and change it directly. We need to use the ETL tool designer to access it. Thus, after the ETL program is deployed, if there is a change in some transform, we need to make the change through ETL tool designer and re-deploy the ETL program. So if we design the ETL in a manner wherein we use our own metadata in our own way, we would end up with a very good ETL architecture. This ETL architecture would eventually evolve into a framework. This framework can be reused across multiple ETL implementations which bring down the estimates and thus the cost significantly. This would give an added advantage to us over our competitors.

With this approach, you would get following advantages in metadata driven ETL:

  • Increase performance
  • Reduced development time
  • Lesser errors
  • Easier and least maintenance which translates into cost savings for customers in long term (added value)
  • Once the basic shell of metadata driven ETL is created, it would reduce the learning curve for the team members.

Then, I started with this mindset, and I was successful enough. Now I am in the process of refining the ETL architecture.

AT LASTL, at the same time creating metadata driven ETL is in no way suggests us to get away from the ETL tools. ETL tools have their own strengths. We would actually be architecting ETL in a different way to make it metadata driven. Metadata driven ETL pays rich dividends during maintenance phase also as it takes lesser effort and allows for quick deployment.

LOOKING FORWARD TO: Some of the above mentioned metadata is already captured in ETL tools available today, but still they need to evolve more. So until, the ETL tools support 100% metadata driven development, we as architects, shall design our ETL in a certain way to fill in the gap. I do hope commercial ETL tools vendors would be working on this and we will soon have next generation of ETL tools.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • LinkedIn
  • Live
  • MySpace
  • Slashdot
  • Technorati
  • TwitThis
  • description
  • E-mail this story to a friend!
  • Print this article!

Jagdish Malani Architecture & Design, BI - Business Intelligence , , , ,