Machine Learning Takes on Data Classification and Security

The traditional way to approach data security often involves blunt force. Malek Ben Salem, Senior Principal at Accenture Labs, argues that organizations must adopt a more sophisticated and nuanced approach to managing data. Amidst a growing stream of documents, it’s critical to adopt a classification scheme. Accenture is addressing this challenge with software and systems designed to improve data protections and controls. Security Roundtable recently spoke to her about the challenges surrounding data security and how businesses can ratchet up protection.

Security Roundtable: What fundamental problems do organizations face regarding all the documents, files, and data they generate and use?

Malek Ben Salem: The growth rate for data is somewhere between 40 percent and 50 percent annually at most companies. A lot of this data is unstructured, so it isn’t possible to save it in a relational data base. It’s business documents, e-mails, and data collected from social media. Classifying this data—and identifying what is sensitive—is difficult. Yet, some of this data might contain sensitive information, including intellectual property, that can put the business at risk.

SR: What does this mean for organizations and the way they approach data security?

MBS: A perimeter-based approach simply doesn’t work in today’s world. An enterprise must embed security controls in the data or make sure they are tied to the data. However, there aren’t unlimited resources to address cybersecurity risks. So it’s critical to understand what’s sensitive. This requires classification tools. We’ve developed a method that handles scalable data classification using machine learning. By automating processes, classifying documents and other types of unstructured data, and establishing restrictions—such as what’s for internal use and what can be shared with the rest of the world—it’s possible to approach cybersecurity more effectively.

SR: How does this process work?

MBS: The first step is to develop a data-classification policy. An organization must understand what data they’re collecting and how it should be classified. This requires an analysis of the data and understanding its value and risk. The next step is to discover where the data is located and where it is stored within the organization. With a classification policy, it’s possible to add the necessary protections.

SR: How is Accenture approaching data classification and protection?

MBS: We have developed a tool called SCAML [Scalable Classification through Machine Learning], which automatically determines and labels the sensitivity of documents. It allows an organization to apply appropriate data protections and controls. The classification technology is advanced enough to identify different types of data. Today, we already have systems that can identify a Social Security number or a credit card number, but we need tools to identify intellectual property, contracts, emails about acquisition plans, and other sensitive or highly confidential documents. SCAML addresses this challenge.

SR: Classifying data is one piece of the puzzle. How does an organization actually protect the data?

MBS: The other important piece of this framework is data access. Who is allowed to view or handle the data? What rights and privileges does a person have? The idea is to establish a cognitive footprint. The obvious tool, and the one traditionally used, is user ID and password. But there are other, far more sophisticated ways to authenticate users. We can examine their behavior—particularly their cognitive behavior. This doesn’t refer to physical biometrics or behavioral biometrics, such as the way you use a mouse or keyboard. It’s a more sophisticated method that looks at the way a person processes information. It includes how they organize information within a file system, how they recollect it, and how they search for information.

SR: How does this approach improve on today’s security frameworks?

MBS: The problem with today’s security controls is that they are binary. With a password you’re able to access the data. Without a password you are not able to access the data. A classification framework using a cognitive footprint allows an organization to achieve a higher level of confidence about the identity of the person accessing data. The system can be used for intrusion detection and identifying masqueraders and insider behavior. In addition, most data is now classified manually. It’s a time intensive and error-prone process that is not scalable. Oftentimes, organizations classify data—often the same data—in different ways but give them the same classification labels. Machine learning and automation can greatly improve on these processes.

SR: How does automation and machine learning work within SCAML?

MBS: In order to automate the classification process and make it more scalable, it’s necessary to give the software classifier samples of sensitive data and samples of non-sensitive data. Machine learning allows the classifier to learn from the samples and extract rules about what makes documents sensitive or not sensitive. Once the classifier develops the models for sensitive documents and highly confidential documents, it can handle the classification process autonomously. The next step is to put the appropriate security controls in place. This includes layering additional security on top of existing controls, such as active authentication. It uses the cognitive footprint as well as other behavioral methods to identify anomalies. It’s also possible to use the classification framework to automate other tasks. For example, the system can encrypt appropriate data through an algorithm or add additional authentication requirements for certain types of documents. These could include multifactor authentication.

SR: Any final thoughts on this subject?

MBS: The beauty of the cognitive approach is that it’s not visible and it is scalable to big data environments. So, ultimately, it’s much harder for an attacker to mimic the behavior of a user. An attacker might be able to observe an employee’s typing or mouse movement behavior, but it’s much harder to observe how a person searches a file system or how they organize their data. All the modeling can be performed locally at the endpoint. So, you have the detection model in place, but the specific user data isn’t visible to anyone. In this way, you protect the user’s privacy, and someone can’t upload or download the specific data. It introduces a level of security that’s crucial for today’s data environment.