Businesses and their environment are producing increasingly fast and large amounts of data. The resulting data has enormous economic potential, if properly processed and analyzed. A data warehouse system provides the ideal basis for this. In this Best Practice Report, we show how and with what open source solutions a powerful data warehouse can be built.
When it comes to making better business decisions based on data holdings, data warehousing remains the defining concept of data management in many companies. However, the increasing flood of data makes the integration of heterogeneous data sources into a coordinated set of data a technically demanding task.
In addition, the creation of a data warehouse is often associated with high licensing and hardware costs. For companies that prefer to invest in software adaptations rather than software licenses, tried and tested open source solutions therefore represent a real alternative. In the area of data management, open source software solutions (OSS) have reached a high degree of maturity, but is open source also suitable for the development of a high-performance data warehouse?
This best practice shows how a high-performance data warehouse can be implemented with open source. To this end, a prototypical example of architecture is designed and built. Pentaho and Infobright solutions are used.
Data warehouse with a 5-layer architecture
A data storage system consists of five levels:
Data source, data collection, data storage, data analysis and data presentation. The data warehouses are automatically merged and prepare the data through extraction, transformation and loading processes. If deductions have to be made in the data warehouse for special applications or business areas, data marts can also be established. Data marts are usually constructed in a multidimensional way and can therefore be optimally used by analytical applications. This concept has not changed much to date: Data warehouses are the central database for all analyses and reports with which the company can be controlled. The data storage system of our prototype has the structure shown in the figure below.
The data from the AdventureWorks2008R2(AWR2) test is used as a database. The ETL process at the data acquisition level is done with Pentaho Data Integration. The tool is part of the open source Pentaho Business Analytics Suite, which is also used for data analysis and presentation in this example. The actual data warehouse within the data warehouse is mapped to the Infobright analytical database management system.
Pentaho Mondrian is used as the OLAP server. The required XML schema is generated with the Pentaho Schema Workbench. Once these components have been applied, data can be prepared at the data presentation level using various Pentaho tools in the form of analysis, dashboards and reports.
Database: Multinational Enterprise
The data in the backup file represents a multinational company specializing in the production and distribution of bicycles. The data model consists of more than 70 tables, divided into five divisions. In its complexity, the model corresponds to the database structures of real companies and is therefore very suitable for demonstration purposes. The Microsoft SQL server is not open source software. However, the backup can be converted into CVS files and then uploaded as a bulk import into any open source database.
This blog entry is an excerpt from the white paper “Building a Data Warehouse with Pentaho”.