What is ETL
and Why is it Important
The modern day knowledge based economy demands
that organizations take informed mission critical decisions for
retaining a competitive edge. The required information is
garnered from multifarious sources and formats. Historical
repositories within the organization, online transactional
databases and external sources of information are integrated
for purposes of multidimensional analysis and interpretation.
Database integration and data mining are mantras of the
information age business houses. Data mining software, data
mining tools and data warehouses are the gold mines from which
informed business decisions are to be made. Implicit in this
process of mixing and matching data from different sources, is
the need to identify relevant information, reconcile
inconsistencies and deficiencies in data and incompatibility of
data types and to harvest information that is clean, consistent
and correct. One of the powerful methodologies adopted by
business entities to ease the process of data integration and
data transfer is the ‘Extract, Transform and Load’ or ETL.
What is ETL? ETL is the process that enables
companies to move data from multiple sources, reformat and
cleanse it, and load it into another database, a data mart or a
data warehouse for analysis, or on to another operational
system to support a business process. It presupposes that the
organization has defined its business needs and identified the
relevant sources of data for extraction, transformation and
loading. It assumes that the organization needs to now
understand the data and examine the incompleteness, inaccuracy
or invalidity of source data it is harnessing in order to
extract it, transform it and load it into the new repository
for meaningful analysis.
The process of data integration, therefore,
begins with data profiling —which aims at discovering the
quality and characteristics of organizational data sources.
Data profiling automatically identifies quality data by mapping
basic statistics, frequencies and ranges; identifying multiple
spellings of the same content; discovering and validating data
patterns and formats; setting out numeric range analysis;
identifying and validating redundant data and defining
primary/foreign key relationships across data sources;
identifying duplicate name and address or null information;
validating data specific business rules within a record or
across sources and so on. In other words, it uncovers the
anomalies of data by examining the content, structure and
hidden relationships between data sources and lays the
foundation for monitoring the quality of data.

The next stage is the process of transforming
data or modifying data. Data Transformation involves the
correction of errors, reformatting of data, standardization of
data across tables, validation of inconsistent and inaccurate
data and even summarization of data based on content.
Transformation could also include transforming fixed record
formats to variable formats and vice versa. Expansion of
records with additional fields containing information from
other demographic information would also fall within the
purview of data transformation. The process helps to cleanse
data as well as integrate data and make it ready for loading
into the data mart or data warehouse.
The load process can be simple or extremely
complex. Old data can be overwritten or new data can be
appended to the old data. Audit trial of changes can be
maintained for historical review of the changes made to
data.
It is apparent from the above that a majority of
the data warehousing project timeline is spent in selecting the
data sources, extracting, transforming and loading data. The
direction of the effort is to create a cost effective and
efficient ETL System to extract, transform, and load data into
the integrated database. Though, ETL systems can be created
from the scratch by organizations using almost any kind of
programming language, improperly designed ETL systems and data
transfer tools can be disastrous to the organization.
Unexpected changes in the format of the source systems can
destroy or corrupt significant amounts of data in the target
system. Moreover, if update cycles of data are not in
Synchronous mode, the ETL system will hold back data until all
systems are synchronized and can be updated. Fractures in
integration of data could result in information being
incorrectly presented. An error of any kind or lack of
foresight while designing the system can result in the entire
process becoming a costly white elephant. It is for this reason
that organizations are increasingly buying off the shelf ETL
tools that help in creating ETL systems and processes without
the hiccups of labored and experiential creativity.
A good ETL tool can communicate with many
different kinds of data sources. ETL tools are designed to read
various file formats and harvest data stored in them. They can
gather excel worksheet data and import it into Access and from
Access to SQL Sever; from SQL Server to Oracle and so on. ETL
tools have been integrated into Enterprise Application
Integration (EAI) systems. Today they not only help users in
extraction, transformation and loading of data, they also help
the user understand the inadequacies of the data they are
handling and provide them with tools to transform such data and
to ensure data quality. The metadata capabilities of these
tools also streamline the process of ETL.
Though the ETL market is rapidly growing, it is
still in a state of flux. There are no major vendors with
integrated Data warehouse solutions. Packaged ETL systems
include Microsoft, which offers data transformation services
bundled with its SQL Server database. Oracle has embedded some
ETL capabilities in its database, and IBM offers a DB2
Information Integrator component for its warehouse offerings.
DataFlux, Informatica, Ascential, DataMirror iDeliver are some
of the ETL tools available in the very high end market. There
are very few tools that are priced for the small and mid size
market. VisualETL is one such tool that while powerful, it is
also reasonably priced so you can receive ROI after just one
small ETL project. Therefore, it is important to make a
detailed assessment of the business requirements and understand
the data that will be handled before selecting an ETL software
tool that is the best fit for your particular needs and
budget.
|