What is ETL?
ETL, or Extract, Transform and Load, eases the combination of heterogeneous sources into a unified central repository. Usually this repository is a data warehouse or mart which will support enterprise business intelligence.
Extract - read data from multiple source systems into a single format. This process extracts the data from each native system and saves it to one target location. That source data may be any number of database formats, flat files, or document repositories. Usually, the goal is to extract the entire unmodified source system data, though certain checks and filters may be performed here to ensure the data meets an expected layout or to selectively remove data (e.g. potentially confidential information).
Transform - in this step, the data from the various systems is made consistent and linked. Some of the key operations here are:
- Standardization – data is mapped to a consistent set of lookup values (e.g. US, USA, United States and blank/null - all mapped to the standard ISO country code)
- Cleansing – perform validity checks and either remove or modify problem data
- Surrogate keys – new key values applied to similar data from different source systems prevent key collisions in the future and provide a cross reference across these systems
- Transposing – organizes data to optimize reporting. Many source systems are optimized for transactional performance but the warehouse will be primarily used for reporting. Often this involves denormalizing and re-organizing into a dimensional model.
Load - the transformed data is now written out to a warehouse/mart. The load process will usually preserve prior data. In some instances existing warehouse data is never removed, just marked as inactive. This provides full auditing and supports historical reporting.
ETL Tools
There are a number of commercial and open source ETL tools available to assist in any ETL process. Some of the prominent ones are:
- Business Objects Data Integrator
- Informatica PowerCenter
- IBM InfoSphere DataStage
- Oracle Warehouse Builder / Data Integrator
- Microsoft SQL Server Integration Services
- Pentaho Data Integration (Open Source)
- Jasper ETL (Open Source)
These tools provide a number of functions to facilitate the ETL workflow. The variety of source data types are handled automatically. A transformation engine makes it easy to create reusable scripts to handle the data mapping. Scheduling and error handling are also built in.
It is particularly advantageous to use an ETL tool in the following situations:
- When there are many source systems to be integrated
- When source systems are in different formats
- When this process needs to be run repeatedly (e.g. daily, hourly, real time)
- To take advantage of pre-built warehouses/marts. Many of these exist for popular platforms such as PeopleSoft, SAP, JD Edwards.
There are also times where the overhead and cost of setting up an ETL tool might not make sense. In these situations some combination of stored procedures, custom coding and off the shelf packages may make more sense. Scenarios of this type include:
- One time conversion of data
- A limited number of source systems that share key identifiers
Sample Workflow
As illustrated here, a typical ETL workflow will move the data through a few distinct phases. This allows each phase to be better defined and eases troubleshooting.
Source > Extract > Stage – this phase extracts all the appropriate data from each source system. The extract copies only data that has changed in the source system since its last run. The stage library contains all source information in a similar structure to how it appears in the source systems. All extracted information will remain in stage until it is successfully processed by the transform.
Stage > Transform > Warehouse – the data from stage is transformed into a warehouse. In this example this step includes some of the base transformations as well as the load of data into a single warehouse. In this phase, surrogate keys are added where needed, lookup value mappings are applied and related information from multiple source systems is combined into a single structure. Any errors encountered here are reported and the problem data remains in stage until corrected. No information is removed from the warehouse and all data there is tagged with effective, update and end timestamps.
Warehouse > Load > Mart – the current effective date from the warehouse is loaded to the mart to support analysis. While this is the final load of the process, this step also includes a transform of the data to an optimized dimensional form for reporting and analysis.
Business intelligence in the enterprise is greatly enhanced by unified data. ETL can be an important tool when combining heterogeneous sources into one cohesive central repository.