Using Pentaho Kettle
Pentaho Data Integration, AKA “Kettle”, is an Open Source ETL Solution
ETL (Extract, Transform, and Load) is a data warehousing process that involves:
- Extracting data from external source(s)
- Transforming data as necessary to fit application needs
- Loading data to target database
Kettle consists of a core data integration engine and GUI applications that allow the user to define data integration jobs and transformations.
Spoon is the primary GUI tool used to model the flow of data from input, through transformation, to output. Spoon is used to connect to source systems and “map” the data to the format required in the target system. This is accomplished by creating both transformations and jobs.
Transformations are the building blocks of an ETL process. They extract data from a file or table, cleanse or transform it, and load it into a target database. The transformations themselves can be strung together and execute in sequence by a Job.
The GUI works like a flow chart. After connecting to the source database, different “operations” are available to the user which will transform the data in a specific way. In this fashion, the operations are strung together, starting at the source, until the target format is reached.
A scripting engine allows for the creation of custom operations which fall outside those provided directly by Spoon. This allows virtually unlimited options for transforming data.
Pan is a command line tool that executes transformations modeled with Spoon. This makes it simple to script the execution of transformations, as well as schedule them to execute at a regular interval.
Kitchen works in a similar fashion to Pan, only for jobs.
Carte is a HTTP server which can be used to remotely execute both jobs and transformations.
Why do we use Kettle?
At LBi Software, many of our projects require an ETL component. We have developed the ETL component with two approaches: stored procedures and enterprise class ETL packages.
The benefit of the stored procedures is they do not require the purchase of additional software; the downside is that they require more of a development effort. The benefit of the enterprise class packages is simplified development, but at a cost.
Kettle fills the gap – we are able to reap the benefits of a powerful, full featured ETL tool without having to incur the cost of purchasing an expensive package.
Our Experience using Kettle
Some of the benefits we have found:
- Can easily connect to multiple data sources within a transformation, whether they be relational databases or flat files
- Ability to use parameters at runtime to reuse transformations/jobs for multiple processes without having to alter the source code
- Comprehensive logging and email features to track status of processes
Some of the challenges we have found:
- Lack of documentation; paid support is available, however freely available documentation is not comprehensive
- Logging output can be complicated and hard to follow for jobs/transformations with many steps
In summary, Kettle is a useful tool for projects that require an ETL component.
Is it possible to schedule Kettle jobs to run as needed and not just manually?