Planning for an ETL
Project
[There are no silver bullets in ETL
project planning process. It is sheer hard work
and dedication that will determine the success
or failure of the project. The process of
collecting, collating, extracting and
processing data is not very simple. The
complexity of the process is best understood
when you begin to plan for an ETL project.
Planning for an ETL project begins with a needs
analysis and includes planning for an entire
gamut of activities that begins with the
creation of the basic ETL design and ends with
the process being moved into the production
environment after testing and user
validation]
All organizations today are
inundated with data. Disparate legacy systems
and non-standardized business practices add to
the growing volume of unstructured information
that is available with organizations. Data
warehouse and business intelligence projects
are initiated with the view to create
repositories of data that support
decision-making and involves extraction of
mission critical data, cleaning of data and
organization of data in meaningful ways. A
large portion of this exercise is devoted to
the process of Extract, Transform and Load
(ETL) development. This is the stage at which
all data challenges are addressed and resolved
and it consumes about 40-70% of the time and
resources of the Data warehouse project.
Therefore the design of a robust ETL
architecture is of prime importance to the
success of the business intelligence project
and is a key factor in determining whether the
information is tailored to the needs of the
various stakeholders of the organization.
Every project begins with a concept that is
defined by the vision, mission, goals and
objectives of the organization. The commitment
of the organization to the achievement of these
goals and objectives is a driver of the
initiative. The ETL project, too, begins with a
felt need for structured information that will
support management decisions. It is the
determinant of the environment in which the
project is conceived and executed.
The project life cycle of an ETL project is no
different from the life cycle of other
projects. The questions that need to be asked
are those that will be asked at the inception
of every kind of project.
1. Is the management committed to project?
2. Are the stakeholders available for any kind
of discussion?
3. Are they willing to give the necessary time
and effort to the initiative?
4. What is the project scope and what are the
deliverables?
5. What is the time line?
6. How detailed are the requirements?
7. Are there sufficient resources committed to
the project?
8. Are there any gaps in resource deployment
that need to be taken care of?
9. Has a cost-benefit analysis been
performed?
10. What is the return on investment
expected?
11. Is the project outcomes scalable and will
it be able to meet the changing needs of the
organization for a specified time frame?
12. Are the team members identified?
13. Are the member’s full time or part
time?
14. What are their skill levels?
15. What kind of training needs to be imparted
to such members?
At the core of the ETL project is the needs
analysis--a definition of the scope of work. An
understanding of the business needs is a good
starting point. While the process sounds very
simple, it is extremely complicated. Users are
often not prepared to discuss dynamic long-term
needs of the organization. Static reporting
requirements are uppermost in their minds and
they often do not think beyond the immediate
needs. While static reporting requirements is a
good starting point, it cannot be the goal of
an ETL project. The project head must make an
all out effort to discover data elements by
framing a series of questions that help
discover sources of data that can be used by
the organization to empower itself. However,
the danger that lurks in the discovery of
additional data sources is the possibility of
data overload, which can ruin the ETL
architecture. So, the ETL project plan should
ensure that the project starts small and
remains simple, while retaining flexibility and
scalability.
The output of this stage of the ETL project
plan, is a logical description of the process
of identifying the source of data, designing
the physical structure for the data,
extracting, transforming and loading data and
identifying the deliverables at every stage.
Business Requirements Specification templates
are often used at this stage of the project to
define the different parameters. The record
layouts, location of source and target,
File/table sizing information and documentation
of data transformation are built up to assist
the project manager in orderly implementation
of the project. These are known as mapping
specifications. Standardized mapping
specification templates are used or customized
to suit the needs of the organization.
The project implementation phase that follows
is the source of great concern and great
frustration for the project manager. After all,
the objective of the ETL exercise is to manage
data and provide the required information to
the end users in formats specified by them.
Data has to be acquired, moved from one
location to another, validated, transformed and
loaded into its destination. He needs to
understand the data combinations and define the
expected results.
To this end, the project manager must choose
the architecture, which will support the
extraction, transformation and loading of the
data. If he has to deal with large volumes of
data, he must consider the potential of
partitioning and running multiple processes to
optimize on the ETL architecture. He must
grapple with issues relating to Data profiling,
monitoring of source systems, need for
continuous ETL and set profiling in order to
avoid the common pitfalls of an ETL project. He
must include error checks and decide how many
test cases he will build. He must have a bird’s
eye view of the technical risks and security.
Finally he must be conscious that the test
results must validate the effort to meet
business requirements. This stage of the
project plan would include documentation of the
test plan; the development of the test plan and
the performance test plan using pre defined or
customized templates.
The project manager at this stage of the
project will be defining the ETL object
Migration Plan and the Database object
migration plan. He will be completing the ETL
mapping specifications and defining the cleanup
process for log and bad files. The test plan
specifications drawn up will include plans for
unit, volume and performance tests.
The next stage in the ETL project plan would be
a Walkthrough ETL process. The project manager
would identify the common modules; evaluate the
efficiency of the ETL code, the business logic,
accuracy and standardization. The documentation
of the test results is a deliverable.
The project then moves rapidly on to user
validation tests wherein the user validates the
end product and certifies whether or not it
meets the business specifications set forth at
the start of the project. The business
requirement specification document forms the
basis of the assessment.
If all business needs have been met the project
manager begins the move from prototype to
production stage. Thereafter the ETL process
maintenance process needs to kick in. This
would include ETL process break definitions and
ETL process update specifications. The
deliverables would include Updated business
requirements specifications, updated mapping
specifications, revised mapping specification,
updated ETL object migration form, developer
check methodologies for final results in
production and monitoring of ETL process.