The traditional data warehouse is designed to process structured data, which it does efficiently and with high performance. But Big Data consists of unstructured data, data flows that arrive in large quantities and at high speed. To maintain performance and efficiency but gain flexibility, the Virtual Data Warehouse offers a promising alternative.
Within a traditional BI architecture, the data warehouse has so far played a central role in providing a 360-degree view of the customer as well as consolidated data storage with different ETL data processes. Since 2014, the introduction of the Virtual Data Warehouse (VDW, LDW) has found more and more supporters and users.
The modern bimodal IT and BI architecture, described primarily by the Gartner Group, enables the realization of traditional BI entities such as reports or dashboards, as well as the integration of combinations of BI applications with large data analyses, such as those found in data mining, text mining or machine learning.
As a result, certain departments within the company can integrate historical data from a Hadoop data group or warehouse, for example, as well as real-time data from sensors, clouds, mobile or manufacturing, and provide timely or in-depth analysis. This analysis can be accelerated and individually adapted through machine learning.
Abstraction of data sources
The core of a virtual or logical data store is the virtualization or abstraction of data sources on a corresponding server platform (mostly in the cloud). Data source virtualization abstracts the many different and mostly distributed data sources through an integrative layer of semantic and logical metadata.
In the future, all applications and services will refer to this semantic layer, which is stored in a repository, as a common denominator. The data virtualization platform provides metadata and access to any data source to various users, orchestrates these accesses and simultaneously optimizes the performance of queries and other operations on the virtualized data holdings.
The Logical or Virtual Data Warehouse is used for self-service BI, creating data services for applications to implement data-based solutions, but also for secure sandboxes, as required by developers, for example. According to the experts at Forrester, Gartner, and others, numerous economic advantages can be achieved through the virtualization of data sources. The development of agile applications for large data analysis and business analysis is equally possible. This also enables digital transformation with respect to OI and ML.
It should come as no surprise that every BI technology provider has its own variant of the virtual data warehouse. It is usually not called that at all, but is marketed under completely different aspects. IBM calls the core technology “Federation”, Informatica focuses on its integration functions in the PowerCenter product, but Denodo Technologies in particular has marketed a clear idea of a VDW. However, all three variants have several components in common: decoupling and abstraction, a semantic layer, data transfer, and security. An unrepresentative comparison will shed light on these aspects.
IBM InfoSphere Federation Server
The Federation server has numerous features of a VDW. Based on the DB2 relational database, the server processes the SQL query language. Therefore, BI users can send distributed DB2 SQL scripts to source systems such as Microsoft SQL Server or Oracle. These push scripts can contain selections, for example, to compare selected time periods, and joins. It joins link tables and records, even individual fields, and performs a transaction on them. The joins are handled on the Federation server by the functionality of the DB2 database, which includes the Query Optimizer, which is critical to the performance of a SQL query in the source systems.
Now the sense of decoupling is obvious: partial SQL statements sent to the respective data sources only return the necessary selection data, but not complete tables. The closely related Big SQL product can be used to access Hadoop, etc., for example to query a data lake in Hadoop with SQL, which can also include Spark SQL. This way, no BI user needs to merge large data sources.
At least in theory, because IBM expert Harald Gröger reports: “In a large German industrial company, a data lake based on Hadoop works with SAP data copies so that SAP data can be analyzed without additional load on the SAP systems. The company accepts the fact that this will result in double the storage capacity, increased network load and double the maintenance work. Other companies want to avoid exactly this through VDW.
Big SQL, a product closely related to Federation Server, enables SQL-based analysis of Hadoop, Beehive and Spark data in a data lake, etc. “For example, if you wanted to include real-time data from weather servers or social media, you could store it in JSON format on Hadoop and then make it accessible to Big SQL,” explains Gröger. Performance can be increased by caching or buffering on DB2. According to Gröger, updates to this memory buffer are critical, for example when it comes to real-time data.
What now greatly simplifies development work and BI queries are the feedback. These views are based on the metadata managed in the DB2 Catalog. If the basic data changes, the data that a view shows also changes. Consequently, the views are always up-to-date. Every BI user can use their content according to their wishes, e.g. for dashboards. Views are defined by DB administrators and cannot be changed by BI users. If additional applications need to be created, the IBM Information Services Director, a component of the IBM InfoSphere Information Server, provides the appropriate APIs. This product allows for federation encapsulation, ETL processes, and data quality cleansing as Web services.
Security and Privacy
As of May 2018, the European Data Protection Regulation (DSGVO aka GDPR) made the technical ability to delete user data mandatory for all companies processing customer data. Therefore, a VDW must also be able to selectively delete or hide data. This can be achieved with the security of Federation servers, e.g. by limiting the visibility of (partial) data in tables through tag-based access control. “Through user rights, access to data can be restricted in fine granularity on both the Federation server and the BigSQL for Hadoop data,” says Gröger.
Selective data deletion should not be a problem with SQL, but what about encryption of client data? Harald Gröger assures us that Hadoop Transparent Data Encryption is also compatible with Big SQL.
The Denodo platform
Software manufacturer Denodo Technologies was founded in 2002, the current version 6.0 of its VDW platform was released in March 2016. The VDW platform opens up a wide range of data sources and formats to BI and enterprise architects, supporting numerous programming languages and programming interfaces (APIs).
The extended relational data model, which is supported internally, should also allow non-relational data structures to be processed efficiently. Therefore, the connectivity of this server also extends to large data sources such as Amazon Redshift, Cloudera Impala and Apache Spark.
The platform supports complex data types such as XML, JSON, key value pairs, and even SAP BAPIs (Business Application Programming Interfaces), in the data model itself and in web service delivery. Denodo claims to offer the widest range of connectors and publishing methods on the market. In addition, Denodo 6.0 can also be used in Amazon AWS in the cloud, similar to IBM’s products.
Dynamic Query Optimizer also works on a cost basis when optimizing queries for cost and performance. Using statistical methods, the optimizer calculates the most cost-effective and highest performing execution plan for the respective query. The optimizer takes into account specific characteristics of Big Data, such as the number of processing units (processors) and storage device partitions. While it can handle any number of incremental queries, this workload management can be further refined with a dedicated workload manager.
VDW, which can be created and operated with it, helps data scientists and administrators generate logical and semantic business views, for example in the course of data discovery and data profiling. With regard to data protection, VDW also provides protection mechanisms for authorization and authentication. Compliance concerns are also taken into account.
Informatica PowerCenter and the Smart Data Platform
Informatica provides more than 7,000 customers worldwide with a proven, robust integration platform-the Informatica Smart Data Platform. Using universal access technology, it can process virtually any type of data and tailor it to user needs using ETL processes. Native APIs ensure high-performance access and the amount of data can be scaled to almost any size.
“The technology also provides functionality to virtualize data sources, but it can also work with external virtual data sources,” said Frank Waldenburger, director of sales consulting, Informatica Central EMEA. “In the direction of the data consumers, there is also the so-called Data Integration Center (DIC). This technology works on the principle of publish and subscribe and can prepare and provide data as a kind of subscription for all connected data providers. This means the end of point-to-point connections of traditional data integration through decoupling, according to the manufacturer. This is exactly the purpose of data virtualization.
Native APIs allow query and ETL functions to be executed directly in the Hadoop, so that only the results of these operations can be delivered if desired. “In general,” says Waldenburger, “there is always the possibility to customize the generated SQL statements for extracting objects from the database, i.e. to optimize them in terms of performance.
The Big Data Integration Hub version supports Hadoop repositories such as Cloudera and Hortonworks. The hub abstracts the complexity of storing and managing raw and processed data in a Hadoop data center or lake. IHL indexes all data stored in Spark, Hadoop, or Hive, making the data accessible and searchable by analysis tools and other applications.
For use with large amounts of data on Hadoop, Hive, Cloudera, and others, Informatica provides the Enterprise Information Catalog (EIC) in addition to IHL. The catalog enables capture of all types of data across the enterprise, semantic search, and discovery of information origin and relationships. Databases can be enriched with business context and even tagged with crowdsourcing tags. The EIC is designed for large data deployments, such as Hadoop clusters. The parallel recording of metadata and rapid distributed indexing enables timely updating of catalog content and increased search performance, depending on the manufacturer.
Extensive data transfers for analysis should be a thing of the past today: Ideally, data should remain in the source systems, where the appropriate scripts make the necessary selections, transformations, etc., rather than being stored untreated in a data lake.
The fact that the Virtual Data Warehouse has a promising future as a platform for data integration, self-service BI and modern BI applications is demonstrated by the growing number of customers using it – under this or that name: Autodesk, Swiss Re, Electronics Arts.
Hopefully this small selection can prove how diverse the features are (still). It should show what functions and performance features the potential customer should pay attention to when considering such a solution. The focus of the considerations must be on efficient, cost-effective and lawful management of mass data.
Virtual Data Warehouse FAQS
What is a virtual warehouse?
A virtual warehouse is another term for a data warehouse. … It collects and displays business data relating to a specific moment in time, creating a snapshot of the condition of the business at that moment. Virtual warehouses often collect data from a wide variety of sources.
What is virtual warehouse in Snowflake?
Snowflake and the Virtual Warehouse
Inside Snowflake, the virtual warehouse is a cluster of compute resources. It provides resources — including memory, temporary storage and CPU — to perform tasks such as DML operation and SQL execution.