ETL is a set of Extraction, Transformation and Loading process, which is performed, in the data staging environment. It typically converts the irrelevant data extracted from several operational systems, to the relevant/useful information. This information thus be stored into the data warehouse environment. The warehouse cannot be implemented efficiently, if the data is not collected, cleansed and integrated in an appropriate way. Hence, the ETL process must be efficiently designed and implemented, for an optimized development of data warehouse.
The major steps involved in the ETL process are,
1.Identifying the data tables, that need to be loaded in the data warehouse.
2. Identifying operational systems from which, the data needs to be extracted.
3. Mapping the raw data from the operational systems to the processing data stored in the data warehouse.
4.Identifying rules/standards for extracting data.
5.Identifying rules for cleansing and transforming the data.
6.Designing the aggregate table.
7.Storing the data tables in the data staging environment.
8. Creating procedures/approaches for loading data.
9. Performing ETL for dimension tables.
10. Performing ETL for fact tables.
Data Extraction: The function of the data extraction is to find and extract the data from various relational and non-relational data sources for future use. Data extraction process includes determining the operational systems from which the data can be collected. Capture the appropriate data from data sources to make required data analysis. Identify the extraction frequency of data i.e., daily, weekly or monthly to keep the data warehouse up to date.
Extraction can be carried in two ways. They are,
(i) Full Extraction: Extract the whole data from the data sources systems every time, then compare the extracted data with the previous data to find the changes. This extraction process does not require the tracking of changes at source systems from the last extraction point This process does not require additional information at source systems.
(ii) Incremental Extraction: In this process, the data is extracted beginning at the last successful extraction from the data source. This method requires an additional mechanism like time scheduler or some triggering events. This process can also be referred as change data capture.
Data Transformation: Data transformation is a process in which the data collected from many dissimilar source systems are transformed in accordance to a standardized format. The main purpose of performing a data transformation is to enhance the quality of the data before providing it as input in the data warehouse. Before this transformation process, reliability, consistency and validity of the data are checked.
Data transformation is carried out by following the below steps.
Data elements should be identified that are to be moved to data warehouse.
Calculated and derived data values should be identified.
Query performance is improved by performing aggregation of data depending upon the granularity of data that is decided while design phase of data warehouse.
Next step is to perform data cleaning.
In this step, define the rules for transformation, data mapping criteria and meta data updation.
In this step, data should be restructured.
Data fields from various sources should be combined into one entity.
For the captured data, data values, data types and field lengths should be standardized.
Data integrity should be maintained within the data warehouse.
Change the data according to the structure and format of data warehouse.
Analysis of data warehouse can be made easy by simplifying and re-arranging the individual fields.
Some of the transformation processes are, aggregation, normalization, union, table load up, join, routing of data, sequence generators, enforcing keys etc.
Data Loading: Data loading is a process of loading the extracted and transformed data, into the data warehouse repository. Typically, it is described by using the following three phrases.
(i) Initial Load: This phrase, specifies the initial loading of all the tables in the data warehouse repository i.e., the tables loaded for the first time in the warehouse.
Initial load moves large data within 8-10 hours. The time required for moving the data depends on the amount of history data needed for effective analysis.
(ii) Incremental Load: This phrase describes the frequent or incremental data changes made in the warehouse. Incremental loading can be performed based on two ways. time stamp some FLAG column-pick up those rows which are marked 'Y'.
(iii) Full Refresh: This phrase, describes the elimination of contents of at least one table and the refreshment of table with new data.
The data loading usually, requires the data warehouse to be kept off-line such that, no users can access the warehouse at the time of loading. Thus, a specified time must be assigned for loading the data. This can be done by fragmenting the data loading process into smaller number of portions such that, only few tables can be loaded concurrently. Such fragmentation has the following advantages:
(a) It enables the implementation of smaller loads concurrently.
(b) It allows certain portions of data warehouse to be kept on-line, during the processing of data loads in the other portions.
Once the loading process has been completed, the data loads need to be tested, so as to ensure, whether these loads have been correctly done in the data warehouse repository. In addition to this, an agenda/plan is designed to test the quality of the loaded records and the data that could not be loaded at the time of the loading process, can be managed by the same procedures, which can be useful for data loading.This approach is rarely used only after loading trails which are carried out during testing and starting phases of the data warehouse.
There are many issues regarding data warehouse architecture that occur at every stage of data warehouse implementation i.e., from designing of a data warehouse to its implementation.
Issues in Designing a Data Warehouse
1.A business analysis framework is to be constructed for which very good understanding and analysis of business needs isrequired. The design of data warehouse is done with respect to four different views, the top-down, the data source, the data warehouse and the business query view.
2. Implementation and maintenance of a data warehouse is a complicated task as it requires business, technological and program management skills.
Business skills are necessary in order to understand how data is stored and managed by the systems, how migration software can be built so as to transfer data from databases and legacy systems to data warehouse, how to build customized software meant for keeping the data warehouse updated. Making use of data warehouse requires understanding the importance of the data stored and also converting business requirements into database queries which can be run on a data warehouse.
Technology skills require the data analyst to understand the analysis of statistical information and history in order to deduce conclusions and facts and take decisions to improve business processes.
Program management skills include technology integration, communication with vendors and end users so that business is time specific and profitable.
3. Design of a data warehouse architecture can be done from either of two approaches or combination of both. The two approaches used are top-down and bottom-up approach. There are a set of pros and cons of each of these approaches. The top-down approach provides a systematic solution and reduces technology integration issues. However, it is expensive, consumes lot of time and doesn't offer flexibility.
The bottom-up approach provides flexibility, cost effectiveness and quick returns investment but also causes problems during integration of various data marts which are in different formats to implement a consistent enterprise of data warehouse. Issues in Selecting a Data Warehouse Architecture Model: There are three data warehouse models,
1. Enterprise warehouse: An enterprise warehouse consists of information which is well organized and summarized. The amount of data is usually in large volume and can range from few Gigabytes to many terabytes. The biggest issue with enterprise warehouse is that it involves extensive business modelling and consume a lot of time in its implementation.
2. Data Mart: A data mart is meant for few people belonging to a specific group. It is a subset of enterprise data which will be of interest to only a particular/selected group of employees. It is implemented using low-cost departmental servers and the time taken is also measured in weeks. But the problem arises when data marts of different types are to be integrated and they are not designed and planned with respect to an enterprise wide data.
3. Virtual Warehouse: A virtual warehouse does not involve implementation of a new data warehouse instead it is implemented with the help of a set of views over existing databases. It is easy to build a virtual warehouse but it increases the processing on existing database servers which in turn requires the database servers to be faster and capable enough to run the views.
It is clear from the above discussion that, construction of a data warehouse is complex and time consuming task, its goal should be specific and clearly defined, the first implementation of a data warehouse should have specific, achievable and measurable goals. These goals in essence are determination of budget and time allocation, part of the enterprise to be modelled, number of departments for which the data warehouse is implemented and the number of databases needed. A data warehouse implementation requires all the responsibilities of a database administrator like performance tuning, data updation, recovery management, security and access control, managing data backups, managing increase in data, etc. Scope management should also be considered wherein the size of the data warehouse, the number of resources, examples, expenditure, etc., should be managed.
Data Warehouse: Data warehouse is a form of storage system (database) where large volume of data is stored in such a way that retrieving desirable information from the system is very easy and reliable. Data warehouse is stored in different location so that it doesn't collide with transactional database system which stores day-to-day information and answers the queries that are prerecorded in the database. Data warehouse system on the other hand provides solutions to sophisticated queries, which involves many computations to be performed at finer-level of granularity.
W.H.Inmon defined data warehouse as a subject (but not application) oriented, consolidated, time-dependent and non erasable collection of data which is mainly supported by decision management. From this definition, the following characteristic features of data warehouse can be defined as,
1. Subject-oriented (Not application-oriented)
3. Time-dependent data
4. Non-erasable data.
Subject-Oriented: Data warehouse focuses mostly on important subjects as operational databases like producer, consumer, manufacturer i.e., data warehouse in not application-oriented. The major role of data warehouse is to support decision makers in making strategic decisions. It does this by performing data analysis and by applying data modelling tools. Data warehouse abstract unnecessary data which is not required in decision making and provide a precise view of data organized around a specific subject. Data warehouse doesn't maintain information about day-to-day transactions of organization but concentrate on the subjects which are critical to the organization.
Consolidated Data: Data warehouse is capable of retrieving appropriate data from heterogeneous databases (like relational databases, flat files) in order to make efficient strategic decisions. Because of the heterogeneity, data is stored in inconsistent manner. To confirm consistency and reliability in naming rules, encoding methods, different techniques such as data cleaning and data integration needs to be applied. Therefore, it is necessary to perform data transformation. data consolidation before transferring data from the operational system into data warehouse.
Time-dependent Data: Data warehouse database not only stores current information but also stores historic information about a particular transaction. In operational databases, though historic information is stored, it generates only preset information as these databases are capable of supporting only day-to-day transactions.
Data in data warehouse are archived as snapshots over historic and present time periods. In these databases, every data structures are time dependent i.e., they are directly or indirectly elements of time. This data warehouse approach is non trivial for design as well as for implementation stages of data warehouse.
The advantages due to time-dependent features in data warehouses are,
It enables analyzing historical data.
It associates information of past data with present data.
It provides better future prediction.
Non-erasable Data: This feature confirms that once data enters the data warehouse it remains static until a particular event is triggered. Data warehouse contains data which is extracted transformed integrated from operational database. The data is transferred from the operational system to data warehouse at regular intervals of time depending on the specification of the business. There are only two operations executed by data warehouse for accessing data.
(a) Data loading method
(b) Data access method.
Because of the above features, data warehouse can be considered as a consistent storage area that provides support for decision making and for analytical reporting.
Advantages of Data Warehouse
Data warehouse is capable of storing and consolidating past information.
It provides support for sophisticated multidimensional queries.
It increases the performance of integrated database system as data from heterogeneous sources are extracted, preprocessed, cleaned, transformed into one unified data store.
It doesn't use query driven approach (as it requires difficult filtering and integration techniques) instead uses update-driven approach where the data collected from various sources are consolidated and stored in warehouse for performing explicit data analysis.
It has the ability of supporting management decision activities like managing relationships with potential customers, managing the cost of important assets, evaluating operation and searching for sources using which profit can be attained.
It is capable of understanding the current business trends and making better forecasting decisions.