I wanted to create a single page to explain entire ETL life cycle process. This tutorial page is easy to understandable about the ETL data warehousing concepts, ETL concepts, and ETL process testing along with a simple/perfect example project.
Let’s take an example of a retail company which is running in two countries, it has the dedicated database for each country. Now the organization would like to see the comparison report of both country sales data.
You cannot be successful in decision making without analyzing the past performance and comparison information with other product data. For better outcome on analysis, it’s mandatory to perform on integrate with the history of data.
Step by step processes for integrating these two databases,
Data modeling ETL Tool Selection ETL Job development ETL Testing Job scheduling and Monitoring Reporting/Analysis
Considering an example for this end to end activity, below are the tables which need to be loaded into a data warehouse from both location databases,
Category Subcategory Billing Location Stock
Steps of data modeling best practices for data warehousing as below,
Understanding requirement – The requirement here is, the management would like to see the comparison report of sales data for all locations for a specific period.
Identifying the source – Identifying the right source based on requirement clarity we need to figure out the source database and table information.For integrating sales information the below source tables are identified.
– Category and Subcategory
Identifying the Facts and Dimensions – Based on above source tables, below table information are all immeasurable data. Hence these are all the identified dimensions. What does a dimension mean?
Here the quantity, price, and total from Stock table are quantifiable columns. Hence price and total are facts here and sales table is a fact table.
Deciding the relationship between Fact and Dimension tables – The relationship between entities/tables will be done using different data modeling concepts.Refer, a step by step detailed information with illustrations for data modeling types (Conceptual, Logical, Physical, and Dimensional).
Deciding the type of slowly changing dimension (SCD) – There are 3 major ways to handle the SCD table, based on need it will be selected. Here I am selecting Type2, where the new record will be inserted for every change and old one will be expired.Refer, the types of slowly changing dimension.
Deciding what type of schema going to be used – There are two types of a schema are there 1. Star and 2. Snow-flake. Here I am going with Snowflake schema.The schema type will be selected based on our business need, advantages, and disadvantages of Star schema and Snowflake schema.
ETL data flow or Architecture of data warehouse database – The data flow would be as below, each source system will have dedicated staging database, from staging tables data would be loaded into a data warehouse.Refer, Why staging database is required or use of staging database here?Data flow diagram
Data Retention Policy – Here we need to mention how long you need to retain this data in database level.I do not want to keep records older than 10 years.What is retention rule and why does it require?
Transformation Rules – I want to see the billing data in month level. Hence the transformation here is, roll up the aggregate or sum() bill data to month.What does transformation refer to ETL?What are the types of transformations?
Calculated Columns in fact table – Quantity and Total columns need to be calculated at month level.
ETL Tool Selection
One of the important stages in data warehouse database building, there are multiple tools with pros and concerns. Based on different factors the tool will be selected.
I would prefer, Informatica tool even the cost is high.
ETL Tools Job development
The job will be developed in ETL tool by the development team. The meanwhile development team will create mapping document for each dimension and fact table along with all transformations.
ETL Process Testing
With the help of mapping document or technical design document, test cases will be prepared along with SQL queries.
As an ETL tester, we need to ensure that the SQL queries are validated by reviewing and running the query.
The testing of ETL job data extraction, transformations and loading process will be tested in Integration and acceptance test environment.
Also, the performance of the ETL job will be evaluated, script or transformation commands will be modified in case of any performance issues.
ETL Tools Job schedule and Monitoring
The job will be scheduled in the production environment to run at regular intervals. The job status and data flow will be monitored periodically.
ETL BI Tools Reporting
The reporting tool will develop the report format with calculations in reporting tool. Testing team will validate the report result set against database data.