When it comes to processing large amounts of data in a large data environment, companies cannot afford to ignore the security of their data. Protecting and safeguarding the company’s own data plays an important role here. We show what needs to be taken into account and which tools can help.
- 1. Taking security into account from the planning stage
- 2. Make data processing legally clear
- 3. Clearly define the purpose of data collection – also internationally.
- 4. Competence data centres – assistance from universities
- 5. Implementing permissions with extensions – Apache Drill
- 6. Security with Apache Falcon
- 7 Apache Knox – Protecting Hadoop Clusters
- 8. Monitoring Hadoop with Chukwa Apache
- 9. Hadoop Cluster Security with Apache Ranger
- 10. Data security with professional tools
1. Taking security into account from the planning stage
When planning large data projects, companies should already plan exactly which employees and groups should be entitled to the various data. The definition of the data transfer processes to the Big Data solution – including possible security gaps – must also be taken into account.
Employees responsible for protecting individual data must be appointed, trained and resourced to ensure security. These can be tools for reporting and monitoring data transfers, but also participation in project groups related to data security. IT security and prevention of data manipulation should also be planned and integrated into the infrastructure at an early stage. In principle, it must be precisely defined which group of people should have access to individual information and evaluations.
2. Make data processing legally clear
The processing of a lot of data can have legal consequences, especially when it comes to social media research or surveys. After planning the processes, companies should definitely have lawyers evaluate the data processing and ensure that the necessary security measures, processes, compliance guidelines, but also that all types of information processing are legally sound. In addition to legally secure data processing, security and data protection also play a role in this regard.
Once third party data comes into play and personal data is collected and stored, great care must be taken, especially if the large data solution is positioned in the cloud. However, another advantage of cloud positioning is that security solutions are often already in place and can be easily activated.
3. Clearly define the purpose of data collection – also internationally.
Data collection without a clearly defined purpose is not allowed. Many large data functions, such as linking queries from various social networks, are simply not allowed in Europe. In the United States, there are fewer obstacles. Other countries, however, are equally strict in this regard. Data controllers should therefore consider a data protection plan for each country from which data are to be processed or in which reports are to be produced.
4. Competence data centres – assistance from universities
Competence centres have been founded in Europe to improve data processing in large data scenarios. The focus is not only on high-performance processing, but also on IT security.
Business managers should examine the publication and information of competence centres. Some competence centres deal with open source applications in this area. There are also Competence Centres for Services and Scalable Data Solutions.
5. Implementing permissions with extensions – Apache Drill
Apache Drill extends the Hadoop and NoSQL database environments with the ability to build SQL queries. The tool also allows you to assign rights to create data. This is also one of the main focuses of this year’s developers. If users create a virtual dataset (CREATE VIEW vd AS SELECT) to share information with other users, this should be easier to control in new versions.
In this case, administrators should be able to define who has access rights to data from Apache Drill queries. In addition, Apache Drill allows you to define a set of virtual data in the data source that only certain users have access to. These users can only create queries for this data and have no access to other information.
6. Security with Apache Falcon
Apache Falcon is also a large data project covering areas of security. The focus is on the ability to define access rules and guidelines for data processing and management. In addition to security, the application can also be used to create rules about how processes and tools should behave when queries are aborted.
Therefore, Falcon can integrate workflows into large data scenarios that not only provide better and faster data processing, but also more security. Notifications can also be integrated so that you can quickly identify if a workflow or action has been aborted and by whom.
7 Apache Knox – Protecting Hadoop Clusters
Apache Knox is a REST API gateway for Hadoop clusters. The Hadoop extension complements the Hadoop security model and integrates authentication and user roles for accessing data in the environment. This also enables access monitoring.
Knox can also access Active Directory and other LDAP directories to connect users. Connection to other authentication mechanisms is also possible. Automation features can also be integrated and policies increase security.
8. Monitoring Hadoop with Chukwa Apache
To monitor Hadoop’s infrastructure, administrators are best served by Apache Chukwa. The system is Hadoop-based and monitors access to HDFS data. The MapReduce framework can also be analyzed and monitored. Chukwa works together with Knox, so administrators can provide more security but also optimal monitoring of the Hadoop cluster.
9. Hadoop Cluster Security with Apache Ranger
With Apache Ranger, companies can provide security, especially in Hadoop clusters. The environment provides security policies and can monitor their implementation. Especially for workloads, queries, and batch tasks, Ranger provides the ability to integrate more security through policy capture and enforcement.
The advantage of Ranger is that the solution also works closely with YARN, SOlr, Falcon, and Apache Kafka and can be integrated into these solutions as well.
10. Data security with professional tools
The best security is currently achieved with applications that are commercially available, such as Oracle Audit Vault and Database Firewall. The product is not only used in large data scenarios with Oracle databases, but also in other projects. The solution can be used to collect and analyze log files from different solutions. In addition to Oracle databases, Microsoft SQL Server, IBM DB2, SAP and MySQL can be evaluated. Hadoop support is also included.
With InfoSphere Guardium Data Activity Monitor, IBM also has such a product in its portfolio. The application prevents unauthorized access to databases and applications. This solution also works closely with Hadoop.
You might also be interested: