Controlling Your Data Part II: Extraction, Transformation and Loading
by Kevin Pledge
In the previous article I discussed the basic strategies for controlling data. In this article I describe Extraction, Transformation and Loading (ETL) tools and what makes them successful.
ETL tools are designed to manage the transfer of data between systems, including transforming the data in the process. Data transformations may range from simple changes of data types to complex merging of records. ETL tools are typically used in data warehouse applications.
ETL tools present the data flow in a graphical format, allowing complex multi-step transformations to be built and managed.
A key component of ETL tools is metadata; metadata is data that is created to support the transformation process; such as definitions and other things, like who uses it, how it's used, when it was processed, etc. There are two types of metadata-business metadata, describing how the data is used; and technical metadata, such as extract time, processes applied and process time.
Success or Failure?
Insurance business can present a number of challenges to the ETL process. Administration systems are typically heavily modified to the point where no two installations of the same system are the same. Departments often operate as independent functional silos and, more often than not, store data in unstructured formats, such as spreadsheets. Finally, the nature of the business results in complex business rules and large amounts of data, including data that may have been converted several times in the past.
This is exactly the type of challenge that ETL tools are designed for, but even with the best ETL tools on the market and the best staff, success is not guaranteed.
The key factors that lead to the success of the ETL process are as follows:
Make sure the target data structure is clearly defined. The target structure should drive the design rather than the source data.
Since the structure of the target is so important, make sure the business users are involved in the definition process.
Be sure to embed ongoing audit checks. However, you shouldn't rely solely on these checks; data errors can be unpredictable, so ad hoc investigation is also important.
The main causes of failure of an ETL implementation are:
The target data structure; just as a well defined structure is important for the success, failure to define the structure upfront or changes to the data structure can result in significant rework.
Over focus on the technological features of the tools.
Attempting to consolidate data in one definition when multiple definitions are required.
Choosing an ETL Tool
There are a large number of ETL vendors available. Informatica has been the traditional market leader. The major database vendors, including Microsoft, Oracle and IBM, have an ETL tool in their suite of tools. Finally, there are a number of niche vendors that offer some appealing features.
As with many software applications, there has been significant market consolidation in recent times, so going with a small niche vendor may present a support risk.
Each ETL vendor will tout their unique features and it is true that their ability to execute complex transformations can vary a great deal. However, at the end of the day, it is the application design that will have the greatest impact.
While ETL tools are designed to control data, there is no silver bullet solution. The success of the ETL implementation is usually more dependent on target definition than the tool itself. Data management is not solely a technology problem, business users play a critical role. However, ETL tools can facilitate the process by making the data-flow more accessible without the need for extensive coding.