In times of Big Data, Business Analytics and Business Intelligence, data mining is becoming an increasingly important area in corporate IT. Data mining means “digging for data” to discover connections, i.e. to look for new insights in data. The relevant information is stored in the data warehouse. In the course of Mass Data, Hadoop comes into play.
A Data Warehouse is a collection of related data, consolidated and permanently stored. It is stored in a structured manner in unstandardized relational databases.
The data warehouse is used to evaluate information from all departments and supports the decision-making process in an enterprise.
The data is provided independently of operational databases and application systems and is prepared for analytical purposes.
Data warehouse system
The operating systems fill the data warehouse system with information that is consolidated in the data provision area. Through ETL processes, the information is cleaned, consolidated, aggregated, and transferred to a data warehouse. The data access tools thus have access to a complete data warehouse or to individual views, which in turn are represented by so-called data marts.
Data Warehouse and Data Marts
One analysis system is the so-called Enterprise Data Warehouse (EDW), which can also be described as a large analytical database with business data generated from different sources. The datasets in a data warehouse are not business transactions to be processed, but rather static and of high long-term value.
On the one hand, the data warehouse environment consists of the ERP (Enterprise Resource Planning) system, the CRM (Customer Relationship Management) system, various legacy systems, and third-party applications. On the other hand, in terms of data utilization, there are reports, OLAP (Online Analytical Processing), ad hoc queries and modeling. OLAP is the evaluation of data stores for retrospective (statistics) or prospective purposes (decision making, management data).
The data sources for the analysis are also known as operational systems. In addition to the data warehouse, the data set can also be a collection of connected data marts, which are short-term, topic-specific data warehouses restricted to individual divisions, also called subsets. Analysis tools such as Business Analytics and Business Intelligence are also used to evaluate the data. Ad hoc queries, dashboards and data mining allow management decisions to be made.
Big Data is a solution for processing very large, complex and partially semi-structured or unstructured, as well as fast-moving data volumes. The data collected can come from a wide variety of sources, is mostly stored in raw form and is used for visualization, analysis and data mining or machine learning.
The move to Big Data: Hadoop
Large companies sometimes invest millions in hardware platforms, databases, ETL (extract, transform and load) software, business intelligence dashboards, analysis tools, maintenance contracts, upgrades, middleware and storage systems with professional data storage environments.
Traditional data analysis allows, for example, to better understand customer buying behavior, optimize sales processes and pricing, and enhance one’s brand. This benefit is somewhat small in relation to the sums spent; analysis with large data tools can achieve much more. Tools such as Hadoop can be used to support business decisions. Hadoop is open source software for processing large data on multiple parallel servers.
Hadoop resides in a Macro Data environment of, for example, web logs, images and videos, social media, documents, and PDFs on the input side. On the output side are the Hadoop Distributed File System (HDFS), operating systems, data warehouses and data marts, and operational data stores (ODS), which are mostly heterogeneous and often consolidated for reporting and business-critical decisions.
Data is stored in Hadoop and then entered into a data warehouse or data mart for further analysis. The data is initially heterogeneous, partially structured, and unstructured. For production-oriented Massive Data applications, fast and cost-effective methods are required to cope with the flood of data.
Hadoop can quickly capture, process, and store data. It also has a good price/performance ratio. Therefore, it is often used as a substitute for a data warehouse.
Data Mining – What is Data Mining?
Data Mining is the application of statistical methods to particularly large and complex data sets with the aim of identifying new patterns.
For example, data exploration can be used to identify and evaluate the purchasing behavior of certain customer groups. A well-known example will illustrate the purpose of data mining:
Data Mining has significantly improved the cross-selling method. In retail, for example, it was found that young parents often buy beer when buying diapers. Retailers then analyzed the combination of diapers and beer and came to the following result.
Data extraction methods have shown that young parents are particularly stressed and therefore like to buy a crate of beer in addition to diapers in order to enjoy a bottle in the evening.
What is possible with data mining and what is not?
Data Mining is a powerful technique that can uncover patterns and relationships within data. But Data exploration does not work by itself. It will not stop companies from learning to understand and interpret their data correctly. Data extraction can uncover information hidden in data, but it cannot tell you how valuable this information is.
To ensure meaningful results from data miners, a company must understand its data. Data mining algorithms often react very sensitively to certain properties of the data, such as:
- Outliers (data values that differ greatly from typical database values)
- Irrelevant columns, that is, columns that vary together (for example, age and date of birth),
- Data encryption, and the data to be included or excluded
Due to the selected algorithm, data mining can perform a large part of the data preparation automatically. However, some of the methods are very specific and not suitable for all purposes. In any case, a company needs to understand its data in order to create models and interpret the results correctly in the end.
On the following pages we will explain how the above examples work and explain other data extraction methods and their results.
Production and Planning: Product Data Integration
Especially in the manufacturing sector, a lot of data information is generated in the form of sensor or actuator data. Their use opens up possibilities such as predictive maintenance. On a business level, it is possible to react to production data in real time. If, for example, data on quality, production quantities, deviations and faults are available, management can react with rescheduling, changes in logistics processes or customer discount offers. In the case of poor quality, for example, rework may be necessary.
The coupling of production data and operational data is called product data integration. In order to transmit the production data to the plant management, it must of course be prepared and filtered.
In connection with Big Data, Data Warehouse and Data Mining technologies, specially trained and experienced specialists are required. They are used to prepare and evaluate the “unearthed” data and make it available to management for decision making.