ETL Tools and Data Cleaning
[Data quality problems are first addressed at the Extract, Transform, Load (ETL) stage, where heterogeneous data is staged and processed for data inconsistencies and schema related transformations. However, data cleaning is multi phase activity and there is no single tool that comprehensively cleans the data and enables the schema transformations. Unfortunately, data inspection and cleaning is still largely a manual process which demand user defined codes for identification of inconsistencies in instance of data. The need of the hour is for a tool that will limit manual inspection and automate the process of Extract, transform and load (ETL)]
Data cleaning or scrubbing is a process of
identifying the errors and inconsistencies in data and
removing them using ETL tools that are off the shelf or
homegrown. Single data collections can present “dirty data”
that needs to be cleaned of misspellings, data entry errors,
missing information and other invalid data. Heterogeneous
data drawn from multiple sources present a number of errors
and inconsistencies, which need to be resolved before the
data can be meaningfully integrated for use by the
organization.
When the data integration requirement scale increases in
data warehouses, federated database systems or global web
based information systems, the quality of data assumes
paramount importance and the need for cleaning such data
increases multifold. Redundancy of data, accuracy of data,
consistency of data and issues relating to consolidation of
data and de-duplication of information become urgent in the
face of mission critical decisions that need to be taken on
the basis of such data.
Every data warehouse project begins with a process of
Extract, Transform and Load (ETL) wherein data drawn from
heterogeneous sources are first staged, then cleaned and
transformed before it is loaded for use in the data
warehouse. The structure, representation or content of the
data, so staged, will demand data transformation. The
transformation may deal with schema evolution, migration of
a legacy system or integration of multiple data sources. The
data differences could lie in the attributes, data types,
component structures, integrity constraints and a host of
other factors at the schema level or in differences in value
representations, gender representations or availability of
information at different aggregation levels in an instance.
Schema integration and data cleaning are both to be done
before solutions can be found. While schema related problems
of data could be addressed by improving schema design,
schema integration or schema translation, instance level
problems will require removal of errors and inconsistencies
in data content. The problem of data cleaning is further
compounded if multiple data sources have to be integrated
and these disparate data sources contain dirty data that is
differently represented, overlapping or contradicting.
Naming convention differences and conflicts can make the
process difficult or structural variations may render the
whole a nightmarish experience.
Data cleaning, therefore, is a long drawn process that is
done in several phases and maybe in several iterations. A
detailed data analysis is followed by a manual inspection of
data or data samples. The data samples are then used to gain
metadata and define data properties and also evaluate data
quality. Transformation workflow and mapping rules are used
to describe the “dirtiness” of data and the number of steps
that may be required to clean and transform the data. Schema
translations are used to map sources to a common data model
using relational representations. The query and mapping
language used are designed to enable the automatic
generation of transformation code and invoke the necessary
user defined cleaning codes and any special purpose ETL
tools designed. The correctness and efficiency of the
transformation workflow and definitions are then tested and
evaluated. Finally, the data warehouse is refreshed and the
cleaned data replaces the dirty data in the original sources
in order to ensure that the legacy applications are loaded
with the cleaned data and the wheel is not reinvented again
and again.
Unfortunately, large portions of this ETL data cleaning
process is achieved manually or by the use of low-level
programs that allow the user specify cleaning functionality
through a proprietary API. These programs are extremely
difficult to write and maintain and there is no automatic
data analysis support to auto detect data errors and
inconsistencies. A few transformation libraries provide for
data transformation and cleaning needs such as data type
conversions, string functions, arithmetic functions and
statistical functions. Extraction values from free form
attributes are also not automatic and support for instance
matching is restricted to the use of the join construct. The
few ETL tools available for data cleaning are focused upon
specific types of data cleaning such as duplicate
elimination or identification of errors in names and
addresses. Hence it becomes necessary to support these tools
with other complementary tools for a more effective data
cleaning process. However, this effort too, proves
ineffective as many of these ETL tools suffer from
interoperability problems.
The need of the hour is for ETL tools that are designed to
limit the manual inspection of data and is extensible enough
to be reused with other data sources. The tool should be
capable of performing data cleaning activities in
conjunction with schema related data transformations based
on comprehensive metadata. Declarative mapping functions for
data cleaning should also be components that can be used
with more than one data source or query function. The tool
should support workflow infrastructure and execute data
transformation steps for multiple sources efficiently and
reliably.