Planning for an ETL Project
[There are no silver bullets in ETL project planning process. It is sheer hard work and dedication that will determine the success or failure of the project. The process of collecting, collating, extracting and processing data is not very simple. The complexity of the process is best understood when you begin to plan for an ETL project. Planning for an ETL project begins with a needs analysis and includes planning for an entire gamut of activities that begins with the creation of the basic ETL design and ends with the process being moved into the production environment after testing and user validation]
All organizations today are inundated with
data. Disparate legacy systems and non-standardized business
practices add to the growing volume of unstructured
information that is available with organizations. Data
warehouse and business intelligence projects are initiated
with the view to create repositories of data that support
decision-making and involves extraction of mission critical
data, cleaning of data and organization of data in
meaningful ways. A large portion of this exercise is devoted
to the process of Extract, Transform and Load (ETL)
development. This is the stage at which all data challenges
are addressed and resolved and it consumes about 40-70% of
the time and resources of the Data warehouse project.
Therefore the design of a robust ETL architecture is of
prime importance to the success of the business intelligence
project and is a key factor in determining whether the
information is tailored to the needs of the various
stakeholders of the organization.
Every project begins with a concept that is defined by the
vision, mission, goals and objectives of the organization.
The commitment of the organization to the achievement of
these goals and objectives is a driver of the initiative.
The ETL project, too, begins with a felt need for structured
information that will support management decisions. It is
the determinant of the environment in which the project is
conceived and executed.
The project life cycle of an ETL project is no different
from the life cycle of other projects. The questions that
need to be asked are those that will be asked at the
inception of every kind of project.
1. Is the management committed to project?
2. Are the stakeholders available for any kind of
discussion?
3. Are they willing to give the necessary time and effort to
the initiative?
4. What is the project scope and what are the deliverables?
5. What is the time line?
6. How detailed are the requirements?
7. Are there sufficient resources committed to the project?
8. Are there any gaps in resource deployment that need to be
taken care of?
9. Has a cost-benefit analysis been performed?
10. What is the return on investment expected?
11. Is the project outcomes scalable and will it be able to
meet the changing needs of the organization for a specified
time frame?
12. Are the team members identified?
13. Are the member’s full time or part time?
14. What are their skill levels?
15. What kind of training needs to be imparted to such
members?
At the core of the ETL project is the needs analysis--a
definition of the scope of work. An understanding of the
business needs is a good starting point. While the process
sounds very simple, it is extremely complicated. Users are
often not prepared to discuss dynamic long-term needs of the
organization. Static reporting requirements are uppermost in
their minds and they often do not think beyond the immediate
needs. While static reporting requirements is a good
starting point, it cannot be the goal of an ETL project. The
project head must make an all out effort to discover data
elements by framing a series of questions that help discover
sources of data that can be used by the organization to
empower itself. However, the danger that lurks in the
discovery of additional data sources is the possibility of
data overload, which can ruin the ETL architecture. So, the
ETL project plan should ensure that the project starts small
and remains simple, while retaining flexibility and
scalability.
The output of this stage of the ETL project plan, is a
logical description of the process of identifying the source
of data, designing the physical structure for the data,
extracting, transforming and loading data and identifying
the deliverables at every stage. Business Requirements
Specification templates are often used at this stage of the
project to define the different parameters. The record
layouts, location of source and target, File/table sizing
information and documentation of data transformation are
built up to assist the project manager in orderly
implementation of the project. These are known as mapping
specifications. Standardized mapping specification templates
are used or customized to suit the needs of the
organization.
The project implementation phase that follows is the source
of great concern and great frustration for the project
manager. After all, the objective of the ETL exercise is to
manage data and provide the required information to the end
users in formats specified by them. Data has to be acquired,
moved from one location to another, validated, transformed
and loaded into its destination. He needs to understand the
data combinations and define the expected results.
To this end, the project manager must choose the
architecture, which will support the extraction,
transformation and loading of the data. If he has to deal
with large volumes of data, he must consider the potential
of partitioning and running multiple processes to optimize
on the ETL architecture. He must grapple with issues
relating to Data profiling, monitoring of source systems,
need for continuous ETL and set profiling in order to avoid
the common pitfalls of an ETL project. He must include error
checks and decide how many test cases he will build. He must
have a bird’s eye view of the technical risks and security.
Finally he must be conscious that the test results must
validate the effort to meet business requirements. This
stage of the project plan would include documentation of the
test plan; the development of the test plan and the
performance test plan using pre defined or customized
templates.
The project manager at this stage of the project will be
defining the ETL object Migration Plan and the Database
object migration plan. He will be completing the ETL mapping
specifications and defining the cleanup process for log and
bad files. The test plan specifications drawn up will
include plans for unit, volume and performance tests.
The next stage in the ETL project plan would be a
Walkthrough ETL process. The project manager would identify
the common modules; evaluate the efficiency of the ETL code,
the business logic, accuracy and standardization. The
documentation of the test results is a deliverable.
The project then moves rapidly on to user validation tests
wherein the user validates the end product and certifies
whether or not it meets the business specifications set
forth at the start of the project. The business requirement
specification document forms the basis of the assessment.
If all business needs have been met the project manager
begins the move from prototype to production stage.
Thereafter the ETL process maintenance process needs to kick
in. This would include ETL process break definitions and ETL
process update specifications. The deliverables would
include Updated business requirements specifications,
updated mapping specifications, revised mapping
specification, updated ETL object migration form, developer
check methodologies for final results in production and
monitoring of ETL process.
The Silver Bullet
Achieving ETL success is not unlike achieving success in
other areas of data warehouse development. It takes a
willingness to experiment and to approach development in new
and sometimes unorthodox ways. Keep an open mind and
remember there are no silver bullets in ETL architecture.
Methods that yield promising results in some situations may
perform poorly in others. The approaches to development and
design discussed in this article are good starting points
for managing ETL processes and ultimately reaching your
goals in this critical area of data warehouse development.