Archive

Author Archive

ETL: How to handle bad data

April 7th, 2009

During any ETL design, we implement various functionalities like validation, auditing, notification, job recovery, job logging, data cleansing, handling bad data, etc. I am going to talk about handling bad data in this blog. At a top level, ETL design allows for bad data to be rejected and sent over to the appropriate users in the form of files. But, in my opinion, there is more to this than meets the eye. As an ETL architect, our responsibility does not end there. So before I get into details of how shall we handle bad data, let me tell you how what makes the incoming data bad enough to be handled. Following are the few reasons that generate bad data:

  1. A set of business rules are laid out that define whether the incoming data is good or bad. Let’s consider the sales record where the cost of the product must be present. If the cost contains a null value or a negative number, the sales records would be considered bad.
  2. Any data that would not satisfy the referential integrity in data warehouse database. This usually happen in case of missing inter-dependent data. If the incoming data contains references to some other data, which could not be loaded for some reason, this incoming data becomes bad and shall not be loaded into data warehouse database. A typical example would be a retail chain maintaining product master in a centralized database. And the sales data is generated across different POS terminals. So during ETL, if corporate database is down for whichever reason, ETL would not be able to load new products. But at the same time, there is sales generated for these new products. Thus with no product in product dimension (master) the sales record is considered bad at this moment and would not be loaded into data warehouse database.
  3. Missing business keys: If mandatory data is missing in the incoming data then that data is considered bad. This usually does not happen if the data is sourced from other relational databases. While sourcing data from files, there is every possibility that data may go missing, even if the format of the incoming feed file is already decided.
  4. Missing data: There may be many cases when the data is missing in the incoming data. That logically makes the data bad. For example an incoming product feed file contains record with no product code and product description. In this case the data is considered invalid.

Now, as we have seen what generates the bad data, we must understand that data cleansing does not make any data bad. Data cleansing is applied only on good data. So before we start data cleansing, a mechanism needs to be put in place to identify good data from the bad data.

Once this bad data is identified, it is usually stored in a separate area called “Rejection Area”. This rejection area can be in a separate schema in the same database that contains staging schema or it can be a separated database altogether. The structure of the rejection area (table structure) is similar to that of staging area with the additional few columns. These additional are required to store metadata about rejected data.

As an ETL architect we need to design our ETL to provide following functionality:

  1. Ability to reprocess this bad data whenever required. The data that could not be loaded due to missing references is usually re-processed when that missing data is loaded into the data warehouse database. Consider the case where due to missing product master, sales data was rejected. Later, when the latest product master is loaded, this bad data (which wasn’t really bad) which could not be loaded needs to be reprocessed. Otherwise the sales summary report would not be accurate. Another factor when the bad data needs to be reprocessed is change in business rules. If there is a lot of rejection due to strict validation rules, customer may decide to change these validation rules so that large amount of data is not rejected.

    We can automate this functionality by adding few columns in the rejection area table:

    1. Reprocess_flag (Y|N): This flag when set, determines if this record needs to be re-processed. Moving these rejected records from rejection area to staging area should be automated. This would help customers in various ways such as reduced the dependency on IT staff, lower maintenance cost.
    2. Reprocess_Job_Id: Usually the metadata about each run of job is maintained. So to be able to audit, when the rejected record was reprocessed, the job id is stamped into this column.
    3. Active_Flag: Once the record marked to be reprocessed, is copied over to staging area, this record is made inactive as this record is not valid anymore. It may happen that this record may fail validations again and end up in rejection area, but it would be considered as another record. So in essence, there is only one instance of the rejected record in rejected area. This implies that inactive records cannot be chosen to be re-processed.
  2. Ability to reprocess incoming data: Many times due to various reasons, the same data that is already processed is fed again into the ETL. This requires us to identify the corresponding records in rejection area and mark them as inactive as these records are not valid anymore. The incoming records would need to be validated as per the current business rules. The records in staging area are compared with the records in rejection area against the business keys and for matching records active_flag is set to ‘N’. This process of marking existing rejected records as inactive is usually automated.
  3. Ability to mark invalid data: Sometimes the business keys in the incoming records are null. These records are then eventually end up in the rejection area and are active. These records, no matter how many times are reprocessed, they would end up in rejection area again. At the same time, the incoming records can be matched with these records. Thus these records should be marked as invalid. For this add following column in the rejection area table:
    1. Valid_flag (Y|N): This flag must be set to ‘Y’ for missing business keys. This implies that this record cannot be re-processed ever.

    The important fact to be noticed here is that the responsibility of ETL architect does not end here. By designing the ETL, to mark the rejected data as invalid does not solve any business problems. The incoming data must get loaded into the data warehouse database. So it becomes very important for BI architect to talk to end users and tell them the impact of this. The end users may need some tweaking in the source systems, but if they need accurate reports they must send the accurate data.

What I have explained is just one way of designing rejection area (tables containing bad data). What I have discussed is the concept of handling bad data. There can be different ways of implementing the above mentioned functionality. Once ETL and database is designed appropriately, an interface must be provided to the end users that allow users to do following:

  1. Select any rejected table and mark rejected records to be reprocessed.
  2. Select any already processed data for reprocessing. This is simple if the incoming data is in the form of feed files. This gets little bit complex when the data is extracted from existing databases. Typically in large ETL systems, the staging area is archived. Depending on the needs of the customer, this goal can be achieved.
  3. Look at the invalid data and analyze it to be able to fix the source systems accordingly.
  4. Execute the ETL job after selecting the rejected records or selected source data for reprocessing. This would depend on various other factors such as ETL time window, the need to reflect the correct data, the time of ETL run, etc.

Last but not the least, as an ETL architect our goal shall not be just to implement some logic to handle bad data. Our main responsibility is to make this whole thing as automated as possible. Automation would provide various benefits such as reduced development time, lesser errors hence increased quality and finally the reduced cost for the customer.

If you have any questions or comments, you can reach me at jagdishm@aditi.com.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • LinkedIn
  • Live
  • MySpace
  • Slashdot
  • Technorati
  • TwitThis
  • description
  • E-mail this story to a friend!
  • Print this article!

Jagdish Malani Architecture & Design, BI - Business Intelligence , , , , ,

Future of ETL - Metadata driven

March 26th, 2009

Some time back, I was called to design ETL for a mid-sized enterprise. I had to deal with various issues like multiple data sources, not-so-clean data, various data validations, data cleansing needs, ETL time window within which the ETL job shall finish, changing requirements, and the list goes on. On the top of that I had very tight timelines. Then with the experience in my previous BI projects, wherein I dealt with ETL with different tools (BO, SSIS), I came up with a list of goals, which I thought shall be there in any standard ETL implementation and must be implemented here too. Here are some of the main goals:

  • Performance
  • Flexible enough to handle changing requirements
  • Least maintenance
  • Automation
  • Modularizing processing
  • Data quality
  • Recoverability of ETL job
  • Auditing
  • Logging
  • Notification
  • Ability to reprocess rejected records, files, or any other data source
  • Remote administration of ETL job

While going though this list, I realized that all these features were there in my earlier ETL implementations too. This fact led me to a list of questions like “How I am going to improvise on my earlier designs”, “How can we take ETL to a new level”. Before I set out to find answers to these questions, I asked myself a simple question “Why”. Why shall I improvise my ETL design? What is the need of this at the first place? I found the answers early. Surprisingly the answers were “Reduced development time, with increase quality results in reduced cost for customers”. This can be a good selling point for sales team too, that I would cover in my next blog.

So coming back to original questions as how to improvise ETL design, I started with comparing different ETL tools that I have worked with in the past, looked at how these tools work, how these tools handle metadata, how they eventually run the job. Interestingly, I found that these tools are good enough to meet standard ETL requirements. But these tools do not make good use of metadata. Even if they use metadata at few places, there is no intuitive interface available. Thus I got my first lead and I decided to make use of metadata as much as possible. Hence, quest for “Metadata driven ETL” starts….

Before I tell you what Metadata driven ETL is, let me tell you “what” it is not. It does not generate the ETL program on the fly as you may expect. Instead metadata driven ETL uses an existing program that executes various tasks by dynamically reading parameters from metadata datastore. This datastore can be an xml file, a separate database repository, few tables in DW database.

WHAT IS METADATA: Any data that is used to execute a transform/operation is called metadata. For e.g. from transformation perspective, in slowly changing dimension transformation type II, the target columns that are considered as business keys is nothing else but metadata. Similarly from operations perspective, the list of database tables that must be cleared before the extraction starts, defines metadata. When we use ETL tools’ standard transforms, or hand-code ETL, we use metadata in one way or the other.

USING METADATA:
How shall we use this metadata to our advantage? The best approach to start with metadata driven ETL is to:

  1. Identify different operations in ETL that can be automated. There are certain operations that are carried out during ETL which can be automated. These operations may involve multiple tables. Usually these operations are put into some script or SQL and are executed as one operation. Let’s look at an operation where we clear clearing staging tables before the extraction starts. Usually staging tables are designed as follows:
    1. Create all staging tables in a separate database
    2. If staging area is same as data mart, create a specific schema like “STAGING” or prefixed the table names with something as “STG_”

Whichever way we go, this list of tables can be read from database system tables (metadata). Using this metadata, this operation can be automated. On the similar lines, we can create specific metadata to be used for other ETL operations. This would make the ETL development easier and faster and lesser probabilities of error. Here are the other sample operations that can be made metadata driven:

  • Moving rejected records to rejected area
  • Reprocessing rejected records
  • Data cleansing: For all incoming data, based on unique keys identified, duplicate records need to be deleted. For this, there isn’t an in-built transform in ETL tools. This can be made metadata driven. ETL would read unique keys for each incoming business entity and dynamically creates and executes it. Thus, developer would just need to specify the unique keys in metadata datastore. This again would bring down the development effort.
  • Notification: Upon completion of an ETL job, it can read metadata (from address, recipients addresses, mail server, subject, body) and this process can be made metadata driven. We all do this already. What is different here is how we implement it.
  • Auditing and logging
  • Identify transformations that can be made metadata driven: There are lots of transformations that can be made metadata driven. We need to identify those transforms and make it metadata driven. For example, let’s see the slowly changing dimension type II. This transformation is heavily used in any ETL program. In this transform, incoming records are compared for pre-defined key columns to the existing records in data mart. If the records exist in data mart they are updated, otherwise they are inserted. For every SCD II, the transform is to be coded. This transformation needs following metadata:
    • Source and target table names.
    • Business keys column names in source table.
    • Key columns in target table.

With the tools like BO or SSIS, this transform is kind of hardcoded. To make this transform metadata driven, read the above mentioned metadata and dynamically create “Merge SQL statement” (available in DB2, SQL Server 2008) in a stored procedure. This stored procedure would be called from the ETL at appropriate places, for all SCD-II transforms. This metadata driven SCD-II transform would give you the best performance. Let’s consider that the staging area is within the data warehouse database or in a separate database on the same server. If some ETL tool’s transform is used, the data is processed in batches of some pre-defined size. ETL engine would apply this transform and fire appropriate SQLs for every incoming row in the batch which is definitely slower. However in metadata driven transform which is a SQL operation, it processes all the data in bulk and we get increased performance.

IMPLEMENTING METADATA: When we use any ETL tool for transformations or for other purposes, we are in fact using metadata. The difference lies in implementation and the usage of metadata. How and where this metadata is stored? How do we access this metadata at run time? Is there any single interface available to access metadata? When we use ETL tool for some transformation, this metadata is stored in the proprietary format. We cannot simply go and change it directly. We need to use the ETL tool designer to access it. Thus, after the ETL program is deployed, if there is a change in some transform, we need to make the change through ETL tool designer and re-deploy the ETL program. So if we design the ETL in a manner wherein we use our own metadata in our own way, we would end up with a very good ETL architecture. This ETL architecture would eventually evolve into a framework. This framework can be reused across multiple ETL implementations which bring down the estimates and thus the cost significantly. This would give an added advantage to us over our competitors.

With this approach, you would get following advantages in metadata driven ETL:

  • Increase performance
  • Reduced development time
  • Lesser errors
  • Easier and least maintenance which translates into cost savings for customers in long term (added value)
  • Once the basic shell of metadata driven ETL is created, it would reduce the learning curve for the team members.

Then, I started with this mindset, and I was successful enough. Now I am in the process of refining the ETL architecture.

AT LASTL, at the same time creating metadata driven ETL is in no way suggests us to get away from the ETL tools. ETL tools have their own strengths. We would actually be architecting ETL in a different way to make it metadata driven. Metadata driven ETL pays rich dividends during maintenance phase also as it takes lesser effort and allows for quick deployment.

LOOKING FORWARD TO: Some of the above mentioned metadata is already captured in ETL tools available today, but still they need to evolve more. So until, the ETL tools support 100% metadata driven development, we as architects, shall design our ETL in a certain way to fill in the gap. I do hope commercial ETL tools vendors would be working on this and we will soon have next generation of ETL tools.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • LinkedIn
  • Live
  • MySpace
  • Slashdot
  • Technorati
  • TwitThis
  • description
  • E-mail this story to a friend!
  • Print this article!

Jagdish Malani Architecture & Design, BI - Business Intelligence , , , ,

IT: Competency Building

March 26th, 2009

IT organizations were doing well until sometime back. Subprime crisis led to fall of many banks in US and before we know, we were in recession. Though nobody knows when this recession would end, but we all hope that it would end soon. Until this happens, IT organizations are facing typical problems that are often seen during recession times such as rising bench strengths, increased costs, and lower revenues. To make the matters worse, there is uncertainty about when the economy would turn around. Hoping that recession would end soon, organizations shall use this slowdown to their advantage and should prepare themselves for the good times. First thing that organizations are trying to do is to increase the employees’ utilization by effectively training employees on newer and in-demand technologies. They are reluctant to hiring at any levels. Thus organizations are focusing on competency building internally. So in the light of cut-throat competition, building competency has to be aligned with organizations strategy derived from sales planning and operations management.

Let’s see how organizations shall go with competency building . . . .

  1. IDENTIFY THE GOALS: Competency building exercise needs some retrospection before organizations take the first step. The reason being, this exercise would have been done even during good times. So organizations must examine the earlier efforts meticulously. They must find out the success rate and impact of it in various projects. Otherwise following the same approach, would yield same results. If organizations are struggling with the competency building since good times, this is an indicator that something wasn’t done right earlier. And during slowdown, it is critically important for organizations to take the right step, or they risk ending up in wrong direction altogether. Going forward, when the economy picks up, organizations that are strategic would have the edge over their competitors. Organizations must approach competency building as follows:

    1. Identify the areas in which the competency needs to be built:People who drive sales strategy (typically senior management) and people who are responsible for operations need to come together and align their goals. The mistake that few organizations do is that they run the competency building in silo. That way, they are never able to build the competency that is required for driving sales growth.
    2. Define the extent of competency building:They must work out the approximate target numbers in each level in pyramid. These numbers must be tied to the sales targets both in the short term as well as long term.
    3. Expectation Management: The next most important factor that defines the success in competency building is to have the right perception. Training employees in a new technology does not make them experts in one go; instead they get a quick and timely head start in the new area. The fact that every organization while hiring, look for a specified years of experience in a particular stream is true across all verticals and at all levels. If this wasn’t true you would have seen advertisements like “We need 5 smart people with or without prior experience for all levels”. Having said so, this does not mean competency cannot be built. It can be build if organizations define a good strategy and ensure its strategy meets the overall organizational goals. For e.g. management must provide for specific follow-up trainings and live projects experiences (even internal project would suffice) and see this exercise through.�
       
  2. ACHIEVE THE GOALS: Once the organizations figure out their goals clearly as stated above, they must ACT to achieve the specified goals. Organization can run different specific programs to achieve these goals. One way would be to run training programs in the areas in which they need to build up the competency. Another would be to use employees available on bench to develop internal tools which are required in order to improve the organizations’ efficiency internally. This shall be done by remaining focused and ensuring that these programs are aligned with their overall goals. Organizations must work to ensure following:

    1. Focused trainings: The training programs should be very much focused. Organizations must do their due diligence in doing gap analysis. They shall consider the business lines they are in and also explore this opportunity in expanding their horizon in different areas. This gap analysis can be done by:
      1. Analyze earlier projects: Organizations must spend time in analyzing earlier projects from different perspectives. They must find out what went wrong and what are the specific areas they must improve upon.
      2. Analyzing earlier sales deals: Also organizations must analyze the pre-sales deals that did not materialized. Few reasons that the sales deal did not materialize could have been like no fitting resources, no prior experience, poor estimates, high cost. Focused training would help organizations to fill in these gaps.
    2. Role based trainings: Next organizations must evaluate their employees. They must get the buy in from the employees being trained. Another thing that organizations must ensure is that these trainings are role-based. This means that if an lead level employee is trained on a new technology, organizations must figure out if the same employee would be able to play the similar lead role in that new technology. To play a specific role in any technology needs some prior experience and this becomes important for the senior roles especially. This holds true for not only during recession times but also during good times.
    3. Development of internal tools: Development of internal tools, are good for all organization in various ways such as:
      1. Building competency
      2. Manage operations effectively
      3. Evaluate a technology and
      4. Helps in sales pitch

    But before organizations jump into developing internal tools, they must take care of few things to be effective:

    1. Selection of right tools to be developed: Organizations must ensure that they are investing in developing the right set of internal tools, which are really needed. When approached, every manager or head of department would have a long list of internal tools that they would like to be developed for them. After all they are not paying for this. This would lead to a bigger mess (multiple applications using different technologies with no consolidated data) later if not managed properly. Organizations must take a holistic approach in deciding what tools to be developed, what technology to be used, and their priorities. Organizations must not use any technology just for the competency building sake. This would help them in coordinating training programs effectively with greater success.
    2. Execution model:After identifying the tool to be developed and the team that is going to work on it, organizations have to make sure the project is executed in a right model, as if it is a live project for a customer. Organizations must not take any shortcuts here and swap the roles or responsibilities within the development team executing the project. For instance, project managers must not be collecting requirements if in live projects they aren’t suppose to play that role. Right set of people shall be given the right set of roles.
  3. PERIODIC EVALUATION: Competency building efforts shall not go on for a long time without having any mechanism to measure the success. Their success must be measured periodically. This would be good for both organizations as well as employees. Organizations would be able to deploy new people effectively. If employees are not convinced within themselves, they would not be giving their 100%.

IT organiztions with an effective strategy, strong leadership and vision, would be able to build the right competency J . . .

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • LinkedIn
  • Live
  • MySpace
  • Slashdot
  • Technorati
  • TwitThis
  • description
  • E-mail this story to a friend!
  • Print this article!

Jagdish Malani Project Management

ETL Design Pattern : E-LT-L

March 6th, 2009

I was ramping up C# to create BI Framework and I hit upon the term “Design Patterns”. I decided to go through few patterns only, as I did not have patience to complete the book, also I assumed I don’t need it. Finally at the end of day, I was able to implement a couple of patterns in my BI Framework application. During this time, I wondered whether there are any pre-defined design patterns in Data Warehouses, ETL, Cubes, and Universes. My quest begins … In this blog, I will focus on ETL alone. 

I started looking back at my projects to analyze if I had used any design patterns or whether I hit upon the recurring problems that could have been solved with alternate designs. What I found was quite interesting. I have been using best practices defined for each component all along. Then why were we slogging all the time. What were the problems we were facing every day? Now the right time! I decided to plunge deep in each component to find out exactly what I could have done differently during design which could have made my life easier. And that is where I came up with an ETL design pattern E-LT-L.  

Before I explain what this E-LT-L means, let’s look at ETL first. ETL stands for Extraction, Transformation and Loading. In ETL we extract data from multiple data sources and transform the incoming data in a format compatible with data warehouse structure, followed by loading into data marts. Usually companies use some commercial tools like SSIS, BO Data Integrator Designer, etc… The developers then make full use of these tools and they end up using most of the functionality provided in these tools. And to an extent, this looks right too, as the companies have made an investment for this purpose.  

Finally when companies, evaluate their ROI, the results are amazing. By following industry standards and using commercial ETL tools, training their development teams, the results do not look good. A new set of problems like performance problems, steep learning curves, fixing the parts that are not broken, buying additional hardware, etc. have come up.  From ETL perspective, even though, most companies have a designated job server, they do not get a good performance. After a while in production, when there data marts grow in size or the size of incoming data increases, the performance of ETL job takes a hit. To resolve these issues, many things are done like increasing configuration of job server, make changes in database structure, use bulk loading options (favorite choice for techies), split the jobs, pushing few things like summarization over  to weekends, etc.. This results in making the system more complex than it should be which has a direct hit on overall IT cost.    

So exactly what went wrong?  

This is where a new pattern E-LT-L comes into existence. Most of the recurring problems can be resolved using this design pattern.  

E-LT-L stands for Extraction, Loading and Transformation, and Loading. This basically suggests that, once the data is extracted,  instead of applying transformations (T) in the staging area, load this data into data mart; and then apply transformations there (LT).  Since, incoming (raw) data is in data mart already, it would make more sense to use database objects (stored procedures) for transformations instead of row-based transformations available in ETL tools. Using database based transformations would resolve most of the problems. 

Effectively, this pattern calls for BULK loading and transformations at correct place without moving huge amount of data around.  

This may sound strange and many people would agree to debate. To prove the point, I would present few scenarios and let you decide what is good for your implementation.  

Scenario 1: Let’s assume you have 1 million incoming records that need to do look up for say customers. No matter what tool you use and how much you configure it, it has to run some sql against customer master, which is typically huge in may DW installations, to get the customer code. This will become a bottleneck as the sql would be executed multiple times. Also to add to the woes, the “customer code” column value has to travel from database server to job server and is stored in the placeholder (variable) in incoming row. Instead if you code the lookup transformations, in stored procedure, with one sql you can update all the rows by simply joining the staging table with customer master table. By all counts, the performance of this sql cannot be beaten.  

Scenario 2: We have incoming product master for a large retail chain. We need to implement slowly changing dimension type 2 here (insert new records and update existing records if already present. Any ETL tool would implement this transformation row by row and it can get painfully slow. This transform can be easily developed in stored procedure using MERGE sql statement. For more experienced developers this can be made meta data driven.  

To summarize, the transformations done using stored procedures help as follows:

  • It does not move data between database servers and job servers. We get a good boost in performance.
  • The data is processed in bulk.
  • Database engine (Oracle,SQL Server) is designed to do this job in the most efficient manner than any third party ETL tool can do.
  • It is always easy to fix/debug the stored procedure than some transform in ETL tool.
  • Small learning curve for developers to ramp up on the ETL tools.
  • Deployment of ETL packages (which is surely a problem in ETL tools). It takes a lot of effort.

I am sure, many of you would be thinking then why we shall use any commercial tool for ETL at all. These tools have their own significance. We would need a commercial ETL tool to achieve other goals of ETL, such as:

  • Defining a workflow where we decide upon what tasks shall execute and in what order. What to do if some task fails.
  • Executing tasks in parallel. 
  • Even though a stored procedure is created to transform some data, it needs to be called from an appropriate place in ETL.
  • Use the logging/notifications capabilities of ETL tool, which usually are very efficient and simple. 
  • Use the scheduling and other features readily available in commercial ETL tools 

What this E-LT-L pattern suggests is the design of ETL architecture. By having good ETL tool and good design of ETL, we can write efficient and manageable jobs. 

If you have any questions or comments, you can reach me at jagdishm@aditi.com.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • description
  • LinkedIn
  • Live
  • MySpace
  • Slashdot
  • Technorati
  • TwitThis
  • description
  • E-mail this story to a friend!
  • Print this article!

Jagdish Malani Architecture & Design, BI - Business Intelligence, Performance ,