What is ETL and Why is it Important
[Business houses strive to create meaningful and
integrated views of their business by harnessing their data
into business intelligence systems and data warehouses.
Efficient management of data requires that the data so harnessed
is cleansed of anomalies and gives an accurate picture of
the state of the business and helps the organization in
making informed mission critical decisions. Extract, Transform
and Load is a process of ensuring that data integrated from
multitude of sources meets the stringent norms of data validity,
provides for seamless integration and supports decision
making activities. Inefficient ETL systems will result in
distortion of information and corruption of target data
systems.]
The modern day knowledge based economy demands
that organizations take informed mission critical decisions
for retaining a competitive edge. The required information
is garnered from multifarious sources and formats. Historical
repositories within the organization, online transactional
databases and external sources of information are integrated
for purposes of multidimensional analysis and interpretation.
Database integration and data mining are mantras of the
information age business houses. Data mining software, data
mining tools and data warehouses are the gold mines from
which informed business decisions are to be made. Implicit
in this process of mixing and matching data from different
sources, is the need to identify relevant information, reconcile
inconsistencies and deficiencies in data and incompatibility
of data types and to harvest information that is clean,
consistent and correct. One of the powerful methodologies
adopted by business entities to ease the process of data
integration and data transfer is the ‘Extract, Transform
and Load’ or ETL.
What is ETL? ETL is the process that enables
companies to move data from multiple sources, reformat and
cleanse it, and load it into another database, a data mart
or a data warehouse for analysis, or on to another operational
system to support a business process. It presupposes that
the organization has defined its business needs and identified
the relevant sources of data for extraction, transformation
and loading. It assumes that the organization needs to now
understand the data and examine the incompleteness, inaccuracy
or invalidity of source data it is harnessing in order to
extract it, transform it and load it into the new repository
for meaningful analysis.
The process of data integration, therefore,
begins with data profiling —which aims at discovering
the quality and characteristics of organizational data sources.
Data profiling automatically identifies quality data by
mapping basic statistics, frequencies and ranges; identifying
multiple spellings of the same content; discovering and
validating data patterns and formats; setting out numeric
range analysis; identifying and validating redundant data
and defining primary/foreign key relationships across data
sources; identifying duplicate name and address or null
information; validating data specific business rules within
a record or across sources and so on. In other words, it
uncovers the anomalies of data by examining the content,
structure and hidden relationships between data sources
and lays the foundation for monitoring the quality of data.
 |
The next stage is the process of transforming
data or modifying data. Data Transformation involves the
correction of errors, reformatting of data, standardization
of data across tables, validation of inconsistent and inaccurate
data and even summarization of data based on content. Transformation
could also include transforming fixed record formats to
variable formats and vice versa. Expansion of records with
additional fields containing information from other demographic
information would also fall within the purview of data transformation.
The process helps to cleanse data as well as integrate data
and make it ready for loading into the data mart or data
warehouse.
The load process can be simple or extremely
complex. Old data can be overwritten or new data can be
appended to the old data. Audit trial of changes can be
maintained for historical review of the changes made to
data.
It is apparent from the above that a majority
of the data warehousing project timeline is spent in selecting
the data sources, extracting, transforming and loading data.
The direction of the effort is to create a cost effective
and efficient ETL System to extract, transform, and load
data into the integrated database. Though, ETL systems can
be created from the scratch by organizations using almost
any kind of programming language, improperly designed ETL
systems and data transfer tools can be disastrous to the
organization. Unexpected changes in the format of the source
systems can destroy or corrupt significant amounts of data
in the target system. Moreover, if update cycles of data
are not in Synchronous mode, the ETL system will hold back
data until all systems are synchronized and can be updated.
Fractures in integration of data could result in information
being incorrectly presented. An error of any kind or lack
of foresight while designing the system can result in the
entire process becoming a costly white elephant. It is for
this reason that organizations are increasingly buying off
the shelf ETL tools that help in creating ETL systems and
processes without the hiccups of labored and experiential
creativity.
A good ETL tool can communicate with many different
kinds of data sources. ETL tools are designed to read various
file formats and harvest data stored in them. They can gather
excel worksheet data and import it into Access and from
Access to SQL Sever; from SQL Server to Oracle and so on.
ETL tools have been integrated into Enterprise Application
Integration (EAI) systems. Today they not only help users
in extraction, transformation and loading of data, they
also help the user understand the inadequacies of the data
they are handling and provide them with tools to transform
such data and to ensure data quality. The metadata capabilities
of these tools also streamline the process of ETL.
Though the ETL market is rapidly growing, it
is still in a state of flux. There are no major vendors
with integrated Data warehouse solutions. Packaged ETL systems
include Microsoft, which offers data transformation services
bundled with its SQL Server database. Oracle has embedded
some ETL capabilities in its database, and IBM offers a
DB2 Information Integrator component for its warehouse offerings.
DataFlux, Informatica, Ascential, DataMirror iDeliver are
some of the ETL tools available in the very high end market.
There are very few tools that are priced for the small and
mid size market. VisualETL is one such tool that while powerful,
it is also reasonably priced so you can receive ROI after
just one small ETL project. Therefore, it is important to
make a detailed assessment of the business requirements
and understand the data that will be handled before selecting
an ETL software tool that is the best fit for your particular
needs and budget.