Rapidly increasing amounts of data and new requirements for data processing push conventional relational databases to their limits.
One of the biggest challenges currently facing information managers in companies and institutions is how to cope with the rapid growth in data volumes from a wide variety of information sources. This is why a buzzword is increasingly being used in the field of information management: “Big Data”. The term refers to very large amounts of data arising from corporate processes or from the collection of data from institutions, which are difficult or impossible to handle with conventional data management systems.
More and more companies already have to efficiently process, store, protect and make available large amounts of data such as log files, transaction data or production data in the multi-bit range for various analyses and strategic scenarios. In addition, there are legal requirements for the long-term archiving of business-relevant information in digital form.
But the volume of data is also increasing significantly in research. Therefore, global research institutions depend on shared data sets, with many billions of individual pieces of information, to meet the growing demand for information. The volume of global data is doubling every two years.
The targeted use of information has long been a success factor in our daily work. It serves companies in the competition for customers and better services, but also to improve our quality of life, for example by evaluating global climate data, genetics or detecting fraud on the Internet and on the stock markets. However, the growing volume of data also places increasing demands on data processing to provide the necessary information or to obtain relevant correlations through analysis.
For companies, the problem of large data being dealt with unilaterally by constantly updating hardware and resources means not only an exponential increase in costs, but also the loss of strategically important information, because data is either deleted or postponed prematurely due to rising costs and is no longer available.
To be able to use Big Data in a targeted manner in the future, new strategies for information management must be found quickly. This is because traditional relational databases do not respond adequately to challenges.
Relational Databases and Big Data Management
Relational Database Management Systems (RDBM) are still the norm in companies today. A relational database is, in simple terms, a collection of tables in which individual pieces of information are stored in records or rows of data.
To avoid having to search all rows of a table every time a search is made and to obtain acceptable response times, the selected values of the rows are enriched with an index. These indexes require additional memory space and cause processing and maintenance work. For this reason, normally only parts of the values in a RDBMS are indexed. However, the lack of indexing causes lower performance.
Relational databases are efficient in terms of their source and architecture if they are designed for frequent transactions at the record level or for scenarios with small or medium data volumes. However, relational database systems are insufficient to process and analyze data volumes in the multi-terabyte range.
Therefore, when processing and analyzing large data, it is necessary to rethink the data management systems. A conventional and relational database will quickly reach its limits with increasing data volumes and will cause higher and higher costs.
Advantages of column-oriented databases in Big Data
The rows and columns of a table constitute a dimensional space that must be available for database access and searches, if necessary. This applies initially to all databases. Row-oriented systems store data line by line like a book. Therefore, a lot of text may have to be read before the correct information is provided.
The term “column-oriented database” refers first of all to the procedure by which the data is stored. Column-oriented databases do not need to read unnecessary information from the rows to find information, but only look for the appropriate values for data retrieval in the columns.
This structure also corresponds to the structure of the SQL language used, which already explicitly addresses columns in the “select” and “where” conditions. Therefore, column-oriented database architectures should be seen more as a collection of indexes that dispense with unnecessary row storage and row access and therefore significantly reduce memory requirements for writing and reading.
This alone allows column-oriented databases to achieve higher query speeds than row-based databases. Performance differences of a factor of 50 or more are possible.
Therefore, additional indexing is either unnecessary with current column-based databases, or not offered at all. This saves a lot of storage space and administration effort for indexing.
However, this alone is not enough to adequately address the challenges of large data scenarios. Additional technologies for data management and scaling are indispensable. Technologies such as map reduction, parallel processing, automatic compression and partitioning make modern column-oriented databases ideal for use in large data environments.
The Map Reduce function
New generation column-oriented database systems have an integrated map reduction function. This is a type of internal multi-level index that allows only the relevant parts of the columns to be processed in queries. This guarantees high performance and scalability at all times, especially for large data volumes.
Horizontal scalability and parallel processing
A modern, column-based database can use multiple standard servers to scale horizontally, or distribute functions and loads to additional servers to achieve a linear increase in overall performance. Optimal use of available system resources, even for individual queries and load processes, is necessary to achieve good results in the Big Data environment.
Adding low-cost hardware to an existing configuration allows the database to increase processing power in a linear fashion, eliminating the need to purchase increasingly powerful and expensive high-end servers. This facilitates flexible adaptation to data volumes and requirements and enables significant performance increases.
Another advantage of column-oriented databases is compression. Since similar or identical values are already logically and physically merged into the columns anyway, a column-oriented database is regularly operated with relatively low memory requirements. Most of them also have additional functions for compressing the column values. This means that column-oriented databases can sometimes require considerably less than 50 percent of the storage space for the raw data from which they were created. This can be a decisive advantage, especially in view of the great challenge of data.
Scenarios for the use of column-oriented database technology
Every day, telecommunications and Internet providers process billions of pieces of detailed information, such as call records (CDRs) or customer e-mail usage protocols, for billing purposes or to prevent abuse or fraud. This data is retained for a defined period of time and then deleted. A column-based database is not only predestined for real-time analyses, but also for the immediate deletion of data after a defined period without negatively influencing the management process.
Researchers around the world access many billions of parallel measured data on climate changes in the air, soil or sea to gain new insights into the development and improvement of our environment. The requirements for the analyses of the central column-based database are complex and divergent. A column-based database can support all analyses in real time and can flexibly adapt to constantly growing research data and usage scenarios, keeping maintenance effort to a minimum while keeping costs low and predictable.
Using the full potential of a column-oriented database
With today’s increasingly complex analytical requirements, growing amounts of data, and a growing number of application scenarios requiring access to information, businesses and organizations need scalable, flexible, and efficient data management solutions.
A modern column-oriented database can meet the most demanding tasks in the area of business intelligence, data storage, and analytical environments, because column-oriented architecture is currently the most appropriate technology to keep up with data growth without sacrificing flexibility in information retrieval. At the same time, it helps keep costs down. Some column-oriented databases are also offered as community versions and allow users with smaller data volumes or requirements to have free entry or cost-effective mixed operation with enterprise systems and RDBMS.
You might also be interested: