ETL
Tools and Data Cleaning
[Data quality problems are first addressed at the
Extract, Transform, Load (ETL) stage, where heterogeneous data
is staged and processed for data inconsistencies and schema
related transformations. However, data cleaning is multi phase
activity and there is no single tool that comprehensively
cleans the data and enables the schema transformations.
Unfortunately, data inspection and cleaning is still largely a
manual process which demand user defined codes for
identification of inconsistencies in instance of data. The need
of the hour is for a tool that will limit manual inspection and
automate the process of Extract, transform and load
(ETL)]
Data cleaning or scrubbing is a process of
identifying the errors and inconsistencies in data and removing
them using ETL tools that are off the shelf or homegrown.
Single data collections can present “dirty data” that needs to
be cleaned of misspellings, data entry errors, missing
information and other invalid data. Heterogeneous data drawn
from multiple sources present a number of errors and
inconsistencies, which need to be resolved before the data can
be meaningfully integrated for use by the organization.
When the data integration requirement scale increases in data
warehouses, federated database systems or global web based
information systems, the quality of data assumes paramount
importance and the need for cleaning such data increases
multifold. Redundancy of data, accuracy of data, consistency of
data and issues relating to consolidation of data and
de-duplication of information become urgent in the face of
mission critical decisions that need to be taken on the basis
of such data.
Every data warehouse project begins with a process of Extract,
Transform and Load (ETL) wherein data drawn from heterogeneous
sources are first staged, then cleaned and transformed before
it is loaded for use in the data warehouse. The structure,
representation or content of the data, so staged, will demand
data transformation. The transformation may deal with schema
evolution, migration of a legacy system or integration of
multiple data sources. The data differences could lie in the
attributes, data types, component structures, integrity
constraints and a host of other factors at the schema level or
in differences in value representations, gender representations
or availability of information at different aggregation levels
in an instance.
Schema integration and data cleaning are both to be done before
solutions can be found. While schema related problems of data
could be addressed by improving schema design, schema
integration or schema translation, instance level problems will
require removal of errors and inconsistencies in data content.
The problem of data cleaning is further compounded if multiple
data sources have to be integrated and these disparate data
sources contain dirty data that is differently represented,
overlapping or contradicting. Naming convention differences and
conflicts can make the process difficult or structural
variations may render the whole a nightmarish experience.
Data cleaning, therefore, is a long drawn process that is done
in several phases and maybe in several iterations. A detailed
data analysis is followed by a manual inspection of data or
data samples. The data samples are then used to gain metadata
and define data properties and also evaluate data quality.
Transformation workflow and mapping rules are used to describe
the “dirtiness” of data and the number of steps that may be
required to clean and transform the data. Schema translations
are used to map sources to a common data model using relational
representations. The query and mapping language used are
designed to enable the automatic generation of transformation
code and invoke the necessary user defined cleaning codes and
any special purpose ETL tools designed. The correctness and
efficiency of the transformation workflow and definitions are
then tested and evaluated. Finally, the data warehouse is
refreshed and the cleaned data replaces the dirty data in the
original sources in order to ensure that the legacy
applications are loaded with the cleaned data and the wheel is
not reinvented again and again.
Unfortunately, large portions of this ETL data cleaning process
is achieved manually or by the use of low-level programs that
allow the user specify cleaning functionality through a
proprietary API. These programs are extremely difficult to
write and maintain and there is no automatic data analysis
support to auto detect data errors and inconsistencies. A few
transformation libraries provide for data transformation and
cleaning needs such as data type conversions, string functions,
arithmetic functions and statistical functions. Extraction
values from free form attributes are also not automatic and
support for instance matching is restricted to the use of the
join construct. The few ETL tools available for data cleaning
are focused upon specific types of data cleaning such as
duplicate elimination or identification of errors in names and
addresses. Hence it becomes necessary to support these tools
with other complementary tools for a more effective data
cleaning process. However, this effort too, proves ineffective
as many of these ETL tools suffer from interoperability
problems.
The need of the hour is for ETL tools that are designed to
limit the manual inspection of data and is extensible enough to
be reused with other data sources. The tool should be capable
of performing data cleaning activities in conjunction with
schema related data transformations based on comprehensive
metadata. Declarative mapping functions for data cleaning
should also be components that can be used with more than one
data source or query function. The tool should support workflow
infrastructure and execute data transformation steps for
multiple sources efficiently and reliably.
|