Visual ETL
Visual Database Migration Tool

  Extract, Transform and Load
Complex Data from Multiple
Data Sources using a Simple
Visual Wizard Like Interface.

Visual ETL: Making Data Migrations Easy
 
 

ETL Tools and Data Cleaning

[Data quality problems are first addressed at the Extract, Transform, Load (ETL) stage, where heterogeneous data is staged and processed for data inconsistencies and schema related transformations. However, data cleaning is multi phase activity and there is no single tool that comprehensively cleans the data and enables the schema transformations. Unfortunately, data inspection and cleaning is still largely a manual process which demand user defined codes for identification of inconsistencies in instance of data. The need of the hour is for a tool that will limit manual inspection and automate the process of Extract, transform and load (ETL)]

Data cleaning or scrubbing is a process of identifying the errors and inconsistencies in data and removing them using ETL tools that are off the shelf or homegrown. Single data collections can present “dirty data” that needs to be cleaned of misspellings, data entry errors, missing information and other invalid data. Heterogeneous data drawn from multiple sources present a number of errors and inconsistencies, which need to be resolved before the data can be meaningfully integrated for use by the organization.

When the data integration requirement scale increases in data warehouses, federated database systems or global web based information systems, the quality of data assumes paramount importance and the need for cleaning such data increases multifold. Redundancy of data, accuracy of data, consistency of data and issues relating to consolidation of data and de-duplication of information become urgent in the face of mission critical decisions that need to be taken on the basis of such data.

Every data warehouse project begins with a process of Extract, Transform and Load (ETL) wherein data drawn from heterogeneous sources are first staged, then cleaned and transformed before it is loaded for use in the data warehouse. The structure, representation or content of the data, so staged, will demand data transformation. The transformation may deal with schema evolution, migration of a legacy system or integration of multiple data sources. The data differences could lie in the attributes, data types, component structures, integrity constraints and a host of other factors at the schema level or in differences in value representations, gender representations or availability of information at different aggregation levels in an instance.

Schema integration and data cleaning are both to be done before solutions can be found. While schema related problems of data could be addressed by improving schema design, schema integration or schema translation, instance level problems will require removal of errors and inconsistencies in data content. The problem of data cleaning is further compounded if multiple data sources have to be integrated and these disparate data sources contain dirty data that is differently represented, overlapping or contradicting. Naming convention differences and conflicts can make the process difficult or structural variations may render the whole a nightmarish experience.

Data cleaning, therefore, is a long drawn process that is done in several phases and maybe in several iterations. A detailed data analysis is followed by a manual inspection of data or data samples. The data samples are then used to gain metadata and define data properties and also evaluate data quality. Transformation workflow and mapping rules are used to describe the “dirtiness” of data and the number of steps that may be required to clean and transform the data. Schema translations are used to map sources to a common data model using relational representations. The query and mapping language used are designed to enable the automatic generation of transformation code and invoke the necessary user defined cleaning codes and any special purpose ETL tools designed. The correctness and efficiency of the transformation workflow and definitions are then tested and evaluated. Finally, the data warehouse is refreshed and the cleaned data replaces the dirty data in the original sources in order to ensure that the legacy applications are loaded with the cleaned data and the wheel is not reinvented again and again.

Unfortunately, large portions of this ETL data cleaning process is achieved manually or by the use of low-level programs that allow the user specify cleaning functionality through a proprietary API. These programs are extremely difficult to write and maintain and there is no automatic data analysis support to auto detect data errors and inconsistencies. A few transformation libraries provide for data transformation and cleaning needs such as data type conversions, string functions, arithmetic functions and statistical functions. Extraction values from free form attributes are also not automatic and support for instance matching is restricted to the use of the join construct. The few ETL tools available for data cleaning are focused upon specific types of data cleaning such as duplicate elimination or identification of errors in names and addresses. Hence it becomes necessary to support these tools with other complementary tools for a more effective data cleaning process. However, this effort too, proves ineffective as many of these ETL tools suffer from interoperability problems.

The need of the hour is for ETL tools that are designed to limit the manual inspection of data and is extensible enough to be reused with other data sources. The tool should be capable of performing data cleaning activities in conjunction with schema related data transformations based on comprehensive metadata. Declarative mapping functions for data cleaning should also be components that can be used with more than one data source or query function. The tool should support workflow infrastructure and execute data transformation steps for multiple sources efficiently and reliably.