Introduction to ETL /Data warehouse Testing

Today let me introduce one of the much sought after skills in the testing industry. It’s ETL(Extract, Transform & Load) /Data warehouse testing.

I will be writing a series of posts which will give you an overall idea about the overall ETLtest process.

Before getting deep into the testing process, let’s understand the basic terminologies.

  • What is a Data warehouse?
  • Why is it required?
  • How is it different from a traditional database?

Consider ABC is a multinational insurance corporation with millions of customers worldwide. The business profits heavily depend on the understanding of the global data. A holistic view of data across the company can be used to identify key areas of improvement.

As a financial service provider, Company ABC needs clarity on the below posted questions in order to increase their revenue

  • How much risk involved with enrolling an applicant for coverage on a particular policy?
  • How much should each customer be charged? (Predictive modeling)
  • What type of policies is more popular?
  • What is the impact of newly introduced products on revenue?
  • What is the most effective distribution channel?

etl testing

In order to get clear insights on above, the company needs to collect its global data.

Suppose the data needs to be collected from three regions like Middle East, Japan & North America. Also in each of these regions data is getting stored in different formats like spreadsheets, flat files or database like SQL server /Oracle etc.

In order to analyze data and to derive business critical decisions,

  • First, we need to extract these heterogeneous data from different source locations
  • Second, transform data in some meaningful way by applying some business logic
  • Finally, load this transformed data into data warehouse
Data warehouse

The basic architecture of a data warehouse (Source:Wikipedia)

A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.

How data warehouse differ from databases?

Or, why can’t we just use databases instead of a data warehouse?

The basic differences between database and data warehouse show us a better picture why analytics requires a data warehouse.

The database which we are referring here is nothing but the operational database helps in the day-to-day processing of data.These systems are also called as OLTP (Online transaction processing) systems. They typically constrained to a single application.Or, in other words, the database is directly linked to the front end application.In the above example (ABC Insurance), the source systems at three different regions belong to OLTP category.

However, the data warehouse is of different type of databases called as OLAP(Online analytic processing). The data warehouse takes the data from all these source databases and creates a layer optimized for and dedicated to analytics.It accommodates data storage for any number of applications. Also, these are is separated from front end applications.Data is refreshed from source systems as needed (typically this refresh occurs every 24 hours). It serves historical trend analysis and business decisions.

An OLTP system is optimized for performing read-write operations of single point transactions.But, OLAP systems optimized for efficiently reading/retrieving large data sets and for aggregating data. A data warehouse is designed to handle large analytical queries.

One of the similarity is that both OLTP and OLAP systems store and manage data in the form of tables, columns, indexes, keys, and views. Both use SQL to query the data.

In order to avoid data redundancy, OLTP database structure features very complex tables and joins (Normalisation). However, In an OLAP database structure, data is organized specifically to facilitate reporting and analysis.The data is denormalized to enhance analytical query response times and provide ease of use for business users. Hence makes easier the reporting and analysis.