Welcome boys, today we are going to talk about Data Warehouse vs Data Lake vs Data Mart, their characteristics and benefits.
The consensus is clear: data is the oil of this age. But there are many ways to store and analyze information, and if the organization chooses poorly among the alternatives it could face a very costly problem with no benefits for the business. The choice between data warehouse, data lake and data mart is one of the main ones, against it the success will depend on who uses the information and in what way.
In general terms, information arrives at these repositories from systems that generate data (ERP or CRM, for example), and then it is analyzed according to predefined rules and sent to a warehouse, a lake or other storage areas. Once the information is centralized in a single source, either warehouse or lake, it is possible to execute data analysis of all kinds to discover trends or insights that help in decision making.
The lake: extensive and deep
A data lake is the place where all forms of data that have been generated through the company are dumped. This includes structured data sources, conversation records, emails, images, audio and video. The protocols for collecting this information are usually very broad, resulting in a very large amount of accumulated information.
There are two key circumstances in which companies often need data lake: when the functions the organization performs are so many and the data generated so multiple that there are many ways to cross-check and design analyses to find value; and when there is no specific plan to leverage the data but the high potential value is known and an intention to use it in the future is defined.
The first approach is comparable to a fully functioning gold mine, the second to someone who is sitting on a gold mine and knows it, but has not yet begun to mine it.
It all sounds nice on paper, but the data lakes handle an overwhelming amount of information. The volumes are so high that traditional databases can take days to execute a single request, so specialized hardware and heavy storage investments are inseparable from data lakes
The warehouse: structured and efficient
Unlike the multiple streams of information and the dark depths of unstructured data present in a data lake, a data warehouse has its shelves clean and its data sorted to extract value in a much shorter time: but not just any data. Data warehouses usually store only information that has already been structured.
This storage strategy allows, however, to serve a wider variety of users in a less complex way than a data lake. It does not take into consideration the difficulties posed by the requirements of specific business units, but rather, after analyzing all the types of data that are useful to its users, it structures them and gives them easy access and operation. The Finance department, for example, may require only the revenue, cost and profit data to model its decisions. With a data warehouse you will not have to deal with information that is not useful to you, and if you need extra data it would be enough to add it to your warehouse.
The market: domestic and specialized
This third strategy could be considered a subsection of the data warehouse. Data marts are designed specifically for a particular business function, or for a specific departmental need.
Unlike a warehouse and a lake, where information is stored in a single, centralized file, data marts have a distinct, decentralized source of data. This dynamic allows a higher level of security for the organization in general, since the unit served by the data mart will only have access to the data previously loaded in its base, without visibility to the rest of the company. The same applies to efficiency: workloads in an isolated environment do not compromise analysis operations in other sectors or departments.
Data marts can, however, be warehouse dependent (created from information that previously inhabited the warehouse); independent (never come in contact with any warehouse data); or hybrid (integrate data from both a warehouse and data unique to the operating unit).
In general terms, data mart is the “smallest” of the three approaches, and is usually oriented towards short-term projects.
The approach will then depend on the present and future needs of the business. Many organizations may never need a data mart to operate successfully, but several analysts recommend that the warehouse and the lake be deployed in parallel. Those large volumes of unstructured data may seem like junk today, but nothing says that they will be the new source of business revenue or what will save an organization from falling into oblivion. However, thanks to the still prohibitive costs of data lakes, that’s a luxury only some organizations can afford with confidence.