Big Data - Learning Path for Beginners : Part 1 - 【2020】
big data learning path

Big Data – Learning Path for Beginners : Part 1

Big Data offers opportunities for businesses, that’s what we know. Companies have a great interest in being able to predict new economic and social trends in a timely manner. This serves not only to seize opportunities but also to deal with risks. A great deal of information is required in order to carry out the necessary analyses. This is done by analyzing data from various sources, which could provide an indicator of the goods and services that customers expect in the future and the market attitude (eager or reluctant to buy) that they will adopt in the near future.

The data are usually provided by the potential customers themselves. The interesting data to be analyzed are often found in social media networks, through requests for support in the company’s local ticketing system, emails or even digitized mail.

In addition to this data, current news reports, publications from market research institutes, but also live data from machine sensors, smart devices and Io stuff are used. Don’t forget to include the data you already have on file. These can be valuable for analysis.

In this article I will briefly describe the central challenges of large data scenarios and then offer a deep dive into the most important technologies and approaches to large data landscape architecture.

The challenges of Big Data

Basically, when designing a Big Data architecture, one must distinguish between five major challenges, which are described as the 5 V’s: Volume, Velocity, Variety, Veracity and Value.

For a detailed definition of these terms see this article

How do I store my data in the source systems?

Before I can ask myself how to access the data I want to analyze, a computer architect must ask how the data should be stored in the source systems. Because the data must be accessible for analysis. The two most important criteria here are availability and performance. However, other requirements, such as the possibility of auditing the data, may be important.

Data enrichment

By enriching the data in the source systems, we are already trying to increase the findability and value of the data here. A typical example is the enhancement of data with tag clouds.

How do I get the data?

For example, publications from market research institutions or online news reports are scanned through REST APIs or web crawlers. Social media data can be scanned by web crawlers or by scanning RSS feeds. Data from the Enterprise Support Portal can often be extracted directly from the underlying SQL database. Companies often have a distributed corporate network. Therefore, data must be retrieved from different locations.

Emails often have to be taken from databases, files or file systems. It is even more difficult with postal mail, as it must first be digitized. If the document must also be searchable, by means of OCR. Other interesting text documents are, for example, records, SMS/MMS or text stored in block chains or file systems. This text must also be analyzed.

But it doesn’t stop there. Under certain circumstances, it can be interesting for a company to extract the speech in the audio track of videos as subtitles in text form, and then perform analysis on this data. This would allow, for example, to analyze how often a buzzword has been mentioned recently in a video portal.

With the consent of the other party, you could even extend this scenario to recorded calls in a call center or text-to-speech search queries. As you can see, even I as a blogger cannot do without buzzwords, because I want to be found. So, classically, I would take advantage of SEO tools through an API and search engines like Google.

However, the data also comes from barcode scanners, magnetic readers or RFID receivers, for example in the warehouse or in the time register. Last but not least, we would have the extraction of sensor data from things in distributed networks or in the cloud. The most important buzzword in this area is Source Data Automation. Behind it is digitization and more automatic processing of data at the point of its creation.

Under no circumstances will it be necessary to enter data that has already been collected elsewhere. A typical example is the transcription of data that has already been printed on a sheet of paper. Analog data forms, such as a sheet of paper, should always provide a quick and inexpensive way to scan the data again at another location, for example, using a bar code.

How do I process the data before saving it?

In a Big Data scenario, a basic distinction is made between:

  • Inactive data, which is stored in a persistent data memory. The data available here are usually aggregated at regular intervals in batch processing, and are processed and stored at the destination for analysis.
  • Moving data, which are analyzed in a constant data flow, usually in real time. Of course, the data stream can also persist permanently in a latent memory for historical analysis. In this context we often talk about stream acquisition or data processing.

How do I store data before analysis?

Firstly: Not all data must necessarily be saved in a new target format before the actual analysis. Some data, for example, can be taken directly from the source system and then fed directly into a calculation. After all, saving data that already exists in any case in a central analytical memory again causes costs. However, there are reasons to store the data accordingly, for example:

  • The performance when loading the data is not good enough, it takes too long to get the data, or it is too expensive
  • Data from the source systems are not kept sufficiently available. Therefore, the data is transferred to a highly available and distributed solution
  • The data from the source systems cannot be used in the original for data protection reasons. Instead, the data must be pseudonymized or made anonymous, as they contain personal data.
  • The data are not available in the source system in a suitable format for analysis. Therefore, the data must be converted to a more suitable target format. For example, if the source system is based on an event record such as Apache Kafka, you may want to convert the data to SQL beforehand.
  • It is more convenient to obtain data for analysis from a central source. Caution: A central source does not mean that the source exists geographically in only one place. At this point, central only means that there is “one point of contact. So the data come, so to speak, “all from the same source”. In practice, this is often called a data lake, or data warehouse. We will explain the difference between the two terms in the course of this article.
Read also
The impact of big data on the retail industry

However, the data lake or data warehouse can in fact be distributed in several places. This also makes sense so that data is available quickly from anywhere in the world and is stored securely. And: Although there is only one contact person for the data, it can be distributed in several different storage technologies. For example, some of the data can be stored in an in-memory database, some in a graphical database like Neo4j, some in Apache Hadoop, and some in a column database that acts as a nearby online storage.

How can I read and process the data before analysis?

The data, which is now stored in a central data warehouse or data lake, must now be analyzed to gain business insight and therefore added value from this data. There are different techniques for this. The most famous of these are data mining and machine learning, but other techniques such as natural language processing (NLP) can also be used, for example, for video subtitling.

How do I analyze the data?

The way in which data is analyzed today has changed compared to then. Today, it is no longer a question of finding the answer to a particular question. This is relatively easy with the tools available today. It is much more important to ask the right questions. In order to know what questions to ask, I have to change my concept of how I juggle and analyze data.

But the analysis of the data also has a scientific character. Typically, the following disciplines are involved in the analysis of large strains of data

  • Decision Science
  • Pattern recognition and automatic learning
  • Natural language processing

A distinction is made between

  • Descriptive analysis. “What happened?”
  • Diagnostic analysis. “Why did it happen?”
  • Predictive analysis. “What will happen?”
  • Prescriptive analysis. “How do we make it happen?”

The most important technologies associated with Big Data

Classic relational databases

By this I mean the classic relative databases, as they have been on the market for decades. The best known representatives of this brand are, for example, MySQL, Oracle, Sybase ASE and Sybase iQ, MS SQL Server, SAP MaxDB and many more. Classic databases are by no means obsolete. Traditional databases work according to the relational model. They can be represented classically in the form of related tables. There are two basic types of relational databases.

Row and column oriented databases – the difference

A line-oriented save file saves the entries as you can see here:

1034691;Loibl;Andreas;21.08.89;single

1034692;Müller;Max;01.04.76;married

Why this is important, we’ll find out when we look at column-oriented databases

A column-oriented data store would store these two data records as follows:

1034691;1034692;

Loibl;Miller

Andreas; Max

21.08.89;01.04.76

single; married

Can you tell the difference? With the line-oriented database, a line in the table is also a line in the memory structure. In a column-oriented database, a column in the table corresponds to a row in the structure of the memory.

How many rows in the memory structure must be read if the entire data record of the employee Andreas Loibl is to be output? With the row-oriented database, only one row needs to be read, while with the column-oriented database all rows need to be read. So you can imagine that the query

“SELECT * OF EMPLOYEE WHERE name=’loibl'”

The line-oriented database is much faster

What if we want to get all the last names out? In this case, in the line-oriented database, all rows must be read, while in the column-oriented database, only one row must be read. The query

“SELECT the name of the employees”

is therefore much faster in a column-oriented database.

What about compression? Let’s assume there are 10 single employees and 20 married employees in the employee table. Let’s say the structure of the harpoon looks like this

Andreas;Martin;Mario;MAx;Robert;Axel;Alex;Oliver;Jan;Sascha

….

single;single;married;married;single

then I could simplify the last line

singlex2;marriedx3;single;marriedx2;single

And this is just a very simple example of the fact that column-shaped databases are often easier to compress. That is, I can usually get more data in a columnar database for the same amount of storage space.

However, strong compression has a disadvantage. Every time I want to insert a new record in the table or change an existing record, I have to first unzip the table, make the change, and then recompress.

Read also
Big Data, Machine Learning, Artificial Intelligence and more terms

To put it briefly:

  • Row-oriented databases are faster when inserting new records into a table and for all types of SELECT* declarations.
  • Column-oriented databases are faster with SELECT <column>-statements
  • Column-oriented databases often require less storage space than row-oriented ones because they can be better compressed

For this reason, line-oriented databases are preferred for applications where a lot of new data is entered and where complete data sets are often displayed. To put it more simply, row-oriented databases are more suitable for systems that manage things that have a physical or monetary value. These systems are grouped together as Online Transactional Processing (OLTP) systems. The databases used for this type of data are also called Operational Data Stores (ODS). These include ERP systems, CRM systems, and most typical desktop software systems.

Column databases are often used for analytical applications such as business intelligence software. To put it more simply, columnar databases are more suitable for managing things of intellectual value: analysis, statistics, scientific texts, etc. This is because they very often use filtered queries that only query certain columns of a data set.

NoSQL databases

NoSQL means Not Only SQL. There are different types. First of all, they all have in common that they usually do not need fixed structures. This means that they do not apply any limitations to the data that is written to these databases. I have already described the concept in this article. So it’s easier to store the data, because they don’t need to apply any constraints, but just put the data into a large structure like JSON.

NoSQL databases store data in a semi-structured form. The way these databases do it determines the type of noSQL database. We will know the different types later. Most NoSQL databases are also MPP databases. The main advantage of NoSQL databases is that they are super-scalable. This means that with a NoSQL database it’s very easy to scale.

NoSQL databases are often used in comparison to traditional SQL databases when you need to read and write large amounts of data simultaneously. Traditional databases can always work well with one or the other, but not both.

What kind of data is suitable for non-SQL databases?

  • Static, one-time written transactional data that does not change frequently. NoSQL databases, with some exceptions, are not really suitable for changing data over and over again because they are not usually ACID compliant. There are exceptions, for example, the neo4j charting database is also suitable for data that changes frequently.
  • Records that are written in large quantities (non-aggregated data). Because non-SQL databases do not have database locks (at least most of them) and do not have to apply restrictions to the data, they write new datasets much faster than classic SQL databases. In addition, many nonSQL databases don’t have to completely rewrite and rebuild the index each time new records are inserted.
  • There are exceptions. For example, column-oriented NoSQL databases such as HBase are less suitable for transactional data, but more suitable for predefined queries on analytical data.
  • In environments where the data schema can change. In the classic SQL world, if you want to add an attribute to a table and your table has 1 million entries, it becomes a huge task. This is because the database has to go through every single DAtaset in the table, add the new column to it, and in the end possibly rewrite the database statistics and indexes. noSQL doesn’t have any schema, so adding a new attribute is very easy and can split the work – you don’t have to execute a huge ALTER TABLE statement at once as SQL. You can start writing new records with the new attribute immediately and move the old records step by step by simply giving them the attribute. You can also omit the attribute completely if you don’t need it with the old datasets. Nothing prevents you from doing this, as no restrictions apply.
  • NoSQL databases are easier to scale than classic SQL databases.

When you switch from a traditional SQL database to a noSQL database, you usually give up:

  • Data integrity. As no further restrictions apply to the data to be stored, the structure of the data is no longer guaranteed. Each data record can have a different structure.
  • Consistency of the data. Non-SQL databases are usually BASE-grant, not ACID. We will learn what this means next. Of course, there are exceptions. For example, the neo4j charting database is ACID-compliant and therefore suitable for frequently changing data that makes high demands on data consistency.

The CAP Theorem

The introduction of NoSQL databases is perfect for addressing the CAP theorem. The CAP theorem states that for a database the following three characteristics are important,

  • Consistency
  • Availability
  • Partition Tolerance

But only two of these characteristics can be achieved with one database at a time. So we can never have all three features at once. That means there are three types of databases according to the CAP theorem

  • CA (availability + consistency). These are traditional RDBMS such as Oracle, SAP Sybase ASE, etc. But also some graphics databases such as Neo4j and the key in memory/store value redistributes belong to this category. CA systems are single-site clusters, that is, all nodes in a scale cluster must always be interconnected. If the connection between the scale nodes is broken, the system crashes. AC systems are usually ACID compliant
  • AP (Availability + Partition Tolerance). For example CouchDb, Cassandra, DynamoDB, Riak. Unlike AC systems, the system remains available even if some fragments fall out. Disadvantage: some of the data returned may not be correct/consistent. AP systems are usually compatible with the BASE.
  • CP (Consistency + Partition Tolerance). This includes, for example, MongoDB, HBase. CP means that some data may not be accessible at the moment, but the other data is consistent and can be transacted and accessed read-only.
Read also
Big Data Future

To understand now, when which type of database is right, we first need to understand the three characteristics.

  • Consistency
  • Consistency with ACID

The ultra plus consistency is achieved when a data lock is ACID compliant.

One of the main weaknesses of most NoSQL data is that it is not ACID compliant. They are based on the principle of “eventual consistency” or BASE compliance. This simply means that more people retrieving the same data may see different results. What does that mean?

An ACID compliant database has several features that ensure consistency of the data it contains.

  1. Atomicity

As an example, let’s take a table of employees, a table of departments and a table of cost centers

Employee: employee number / name / first name / Department ID

Department: ID / Department name

Cost Center: Account ID / Name / Department ID

So in our model, an employee is assigned to a department, and this department is linked to a food service point.

Suppose you make a change, for example, by changing the identification of a department. This requires you to change the corresponding department ID in all three tables. Technically, these are three update declarations. These three operations are now encapsulated in a single transaction. Atomicity refers to the concept that a transaction is only consistently completed when all three transactions have been completed.

If, for example, only two of the three transactions can be executed, because a power failure on the way stops the database, the transaction is reversed when the database is restarted. This restores the database to its original state. Can you imagine the confusion that would result if you changed the department ID in the employees and departments table while the cost center still refers to the old department ID?

2. Consistency

Consistency represents the principle that the database applies restrictions to data storage that a programmer cannot break in his application code. For example, an application in an ACId database cannot insert text in a table column that has the data type “integer”. It is also not possible for an application to exploit the backward functionality of a database (which we discussed in point 1).

3. Isolation

The principle of isolation addresses the following problem: suppose two employees work in the sales department of a company. a customer writes an email and uses this communication channel to announce a change in an existing order. The email is sent to a common address: info@tech1985.com. Now it happens by chance that both employees read the customer’s order in the mailbox at the same time, but do not know that they want to process the same order.

Orders
Order IDCustomer IDArticle NumberQuantity
11120
11210
1135

So currently we have in this table a single customer order with ID 1, which includes three items with different quantities. In his email, the customer wants to “reduce the quantity of item 2 by 5 and the quantity of item 3 by 2”.

What problems can arise if both employees work on the same order at the same time?

Dirty reading: The customer wants to reduce the quantity of article 2 by 5 and the quantity of article 3 by 2 in his email. Employee 1 sees the order as it is currently: quantities 10, 20 and 5. He changes the order to quantities 10, 15 and 3 as requested by the customer and sends the transaction. Employee 2 reads the current status of the order while the changes are being made.

By the time employee 2 looks at the order, the first change from 20 to 15 has already been made, but the second change from 5 to 3 has not yet been made. Employee 2 now processes the order again, reducing the quantity of item 2 from 15 to 10 and item 3 from 5 to 3. The data record for article 2 has already been changed, but not yet for article 3.

The problem that led to this misunderstanding is called “Dirty Reading”: Employee 2 has read data that has not yet been properly altered. If Employee 2 had seen the same data as Employee 1 while the transcript was not completed, he would have performed the same steps as Employee 1. This problem is resolved at the database level by allowing Employee 2 to read the records before or after the employee 1 write transaction is completed, but not during the transaction.

This still does not solve the problem that both employees do not know that they are working in the same order, and employee 2 could theoretically reduce the order from 10, 15 and 3 to 10, 10 and 1.

Missing updates: for this problem we have to take another example. We have a warehouse management system that keeps track of what product is currently in stock and in what quantity

Article numberNameItems in Stock
1computer mouse50

Two transactions are now executed in parallel. These are two sales transactions. The following happens at different times in the transactions

Transaction 1Transaction 2
Find out how many pieces are available

50

Find out how many pieces are available

50

Writing: Write: Sell 2 units.
ItemsInstock = ItemInStock-2
Sell 3 units
ItemsInStock = ItemsInStock-3
Confirm: Computer mouse still available: 47
Confirm: Computer Mouse still available: 48

You see the problem? Although 5 units are sold in total, so there are only 45 units available, the database now shows that 48 units are still available, so the sales of transaction 2 have been lost. How would you solve this problem? By not performing the transactions in parallel, but one after the other. In transaction 2 you read that instead of 50 units, there are only 48 available and you subtract the three units sold from 48.

Non-repeatable readings: We continue with our stock management of 50 computer mice. This time we have two transactions. One makes a simple sale of 5 units and therefore removes those 5 units from the table. In the other transaction a sale is made, but in the form of a calculation, which states that at time X the customer has the right to buy half of the shares. In the transactions everything looks like this.

Transaction 1Transaction 2
Read Items in Stock:50
Read Items in Stock: 50
Write: To sell 5 units

Items in Stock= Items in Stock – 5

commit
wrote: sell 50% of the shares

ItemsInStock = 45 * 0.5

commit

The problem at this point is different from a previous example. Neither update has been lost here, as with the Lost Update problem. One of the two updates gives a wrong result, because the ideal data changes during the calculation of the transaction. Instead of transaction 1 which gave the customer 25 items, he now receives only 23 items. This problem could also be solved if the offered transactions are not performed in parallel but sequentially one after the other.

Phantom reads

To explain Phantom Reads, we’ll use a recent example. We have a table in which we store employee records.

IDLAST NAMEFIRST NAME
1LoiblAndreas
3DellerMichael

What we are trying to do now is to run two transactions in parallel again. Transaction 1 executes a reading process once at the beginning and once at the end of the transactions. Between the two read processes, which are for example 10 seconds apart, we insert a new record.

Transaction 1Transaction 2
Read the records in the table
2 employees are returned
Add an employee with ID 2
Commit
Read the data records in the table
3 Employees are returned

It is important to understand that there is nothing wrong with the newly added employee showing up as soon as transaction 2 is completed.

When you insert a new record into a database, you want it to be displayed, right? The problem here is that transaction 1 is supposed to be the only transaction in the database at run time. If two read accesses to the same database within a transaction return different values, that’s the problem. Because how is transaction 1 supposed to do anything with the data it reads if it can’t trust the data?

Isolation levels

So the insulation component of ACID addresses four fundamental problems. There are different levels of isolation that can be applied to a transaction. The higher the isolation level, the fewer of these four known problems occur. The simplest solution – and at the same time the highest isolation level – is to completely prohibit parallel execution of transactions. This means that as long as a transaction is executed on one data set – regardless of whether it is read or written – this data set is blocked for other transactions.

A database sets these locks technically using database locks. It is important to understand that only the data set to which a transaction applies is locked. Databases that are not affected by the transaction may be influenced by other transactions. Serialization does not allow parallel transactions in the same data set. This has a negative effect on performance. For a new transaction to be executed in a data set, the predecessor must first be completed. The greater the level of isolation of a database, the less likely it is that parallel transactions will occur, and the lower the performance of the transaction execution, the lower the risk of inconsistent data. The lower the level of isolation, the higher the degree of execution and therefore the better the performance, but also the higher the possibility of data inconsistencies.

Isolation levelDirty ReadsLost UpdatesNon-repeatable readsPhantoms
Read Uncommittedmay occurYesmay occurmay occur
Read Committeddon’t occurYesmay occurmay occur
Repeatable Readdon’t occurNodon’t occurmay occur
Serializabledon’t occurNodon’t occurdon’t occur

4. Durability

The last component of the ACId, Durability, states that transactions that perform manipulations on data are only completed when the data is verifiably stored in a non-volatile memory. This means that if data is only present in memory and has not yet been written to a hard disk, for example, all other data manipulations of this transaction will be reversed if a power failure occurs and the database solution is restarted. This is because the transaction is only completed when all data manipulation operations for a transaction have found their way onto a fixed data carrier.

The base consistency

Now that we have discussed the consistency of acid, the next topic is BASE consistency. Compared to the consistency of the acid, the BASE consistency is a reduced version of the guaranteed data consistency. Base consistency usually means that a “possible consistency” of the data is obtained. This means that if everything goes well, the data are consistent, and if not, there are slight inconsistencies, if they have to be corrected on the fly.

Base consistency is an “optimistic approach”, while acid consistency is a “pessimistic approach”. But an application that “only” works with the dogma of base coherence is usually able to work more efficiently, as several applications can be processed in parallel. Base consistency is often used for domains where the database solution load is high, but the importance of 100% data correction is secondary.

For example, it is not important if what has been posted on Facebook is 100% correct or if what has been posted in the last 5 minutes has not been taken into account in a read statement. Consistency of the base does not mean that consistency of data is not achieved at some point. With the base, however, consistency is not immediately achieved when writing the data. However, this allows for performance-optimized data writing through the use of queues.

BASE means

1. Basically available

The database is mostly functional, and if a part of it is not there, most of the database still works. The database data is always redundantly distributed over several nodes. If some nodes are down, the data from the other nodes is still available.

2. Soft state

However, some of the available data may be outdated and therefore inconsistent. It is possible that a failed node has a more catalogued version of the data (e.g. has a number of facebook likes for a publication), i.e. in the node that is still available. This is the difference with CIDA. You can imagine that this is fine for Facebook likes, but unacceptable for your bank account balance?

If you maintain an ACID compliant database, there is always a master and a slave node. A write action to the database is only completed when the data has persisted on both the master and the slave. That is, if one of the two nodes fails, the other node has exactly the same data state, making the data consistent. It is not the same with BASE, here the slave may not have obtained an update from the master for a long time. But the processing speed of the write actions is faster.

3. Eventual Consistency

The final consistency now means that the database solution has implemented a logic to allow the different data states, ideally due to the soft state property of BASE to diverge, to “repair” themselves. This means that sooner or later, at some point, the data in a BASE compatible database will be consistent.

We stick to our example with the “Facebook Likes”. A node that currently measures 500 Likes has failed. The 500 Likes node has not yet been replicated in the other nodes. You’re still assuming 400 Likes. If a node with 500 Likes is not online, the database may have implemented a logic for the newer records to automatically overwrite the older ones.

What if new Likes have been added in the meantime? Let’s say that at the time the node with 500 Likes was not available, new Likes have been added, so the other nodes now have 450 Likes. Well, in that case, if you simply take the number of likes from the old node, you would take control of 500 nodes, which in fact results in the loss of 50 likes. This is because previously 500 cycles were still in the system, and the 50 that were added are no longer considered. This is also why it is called contingent consistency. At some point, the system will reach a technically consistent state, but there is no guarantee that every transaction will actually be consistent.

By the way, you can solve this problem not by storing the number of likes as a stupid number, but by actually using connections. That means that if a user clicks on “Like”, a connection is established with this user. In this case, when the nodes are viewed again later, they will simply swap all the connections with each other, and the database will be consistent again.

Work pattern unit in the application code

If you are an application developer, you might say, “Who needs to comply with ACID? I can take a BASE compliant database and take care at the application level that the transactions are always executed well one after the other. Better yet, at the application level I can prevent a user from processing an order that another employee is working on. In this way, I prevent employees from overwriting the changes of others who are not recognized. Therefore, the application of this design guideline is not only useful in environments with high demands on data consistency, but in many cases it is even necessary.

Basically, this would be an idea, but at this point you have three problems, the last of which is the most important:

  • An application-level LUW usually consists of several database-level LUWs. These database-level LUWs can run in parallel and therefore cannot be isolated enough if the corresponding logic is not implemented. Accordingly, an SAP LUW consists of three LUW databases that are generated in three consecutive input templates. At the end of the SAP LUW, the three LUWs are released from the database. Once released, the three released database LUWs could now be run in parallel and therefore encounter problems at the database level.
  • In addition to the application for which you write the source code yourself, the database is also accessed by other client programs, such as command line clients or Excel macros. You would not have covered this problem.
  • Creating consistency at the application level requires constant communication between individual instances of the server. The end result is that, although you have a BASE database, you have the same disadvantages as if you were using an ACID database. This is because you still have to wait until transaction 1 is completed before you can start transaction 2. You no longer have the advantages of a BASE database if you implement this design pattern.

You might also be interested: