Clearing the Air: How Machine Learning Can Enable Data-based Cyber Security

16 Sep

Clearing the Air: How Machine Learning Can Enable Data-based Cyber Security

in Blog, Perspectives, Technology

by Madhusudana Shashanka

There’s been some great excitement in the cyber security industry around machine learning, especially with anomaly detection and behavioral analytics. Despite the buzz, I have noticed more than a healthy dose of skepticism amongst security experts.

As a machine learning researcher and practitioner, this skepticism baffled me when I first started in the security industry. Familiar with the successes of machine learning in other fields such as retail, marketing and others, I wondered why people – mostly security experts and practitioners – stubbornly refused to embrace it. Upon research, I found a wealth of literature in academia going back all the way to the eighties touting great results of machine learning on security applications. However, once I started digging, I found very few examples of successfully deployed real-world machine learning systems. There have been attempts to understand this startling discrepancy that also offer lessons for machine learning practitioners.

Despite the challenges, machine learning can indeed make a big impact in security, if done right. Unfortunately, over the years, security vendors have spread a lot of misinformation and built up false expectations around machine learning.

So first, let’s define what machine learning is, and what it is not.

Of all the descriptions and introductions of machine learning I have come across, I like John Platt’s version best. I attended a Deep Learning meetup last year where he was presenting, and he started talking about how people often conflate machine learning, data mining, and artificial intelligence. He said:

Machine learning is actually a software method, it’s a way to generate software… so, it uses statistics… it uses data to produce programs, and those programs are eventually put into production…

When you can explicitly codify all the input-to-output mappings, standard software development can be used for automation. But when the problem is unbounded, attempting any sort of comprehensive input-to-output mapping is essentially impossible. Without having to codify the mappings, you provide some observed inputs and the corresponding outputs. The machine-learning algorithm does the rest, automatically figuring out the mappings. Machine learning techniques can be used toward artificial intelligence (emulating human intelligence) or for data mining that involves extracting specific insights from data. This is machine learning – nothing more, nothing less and no magic involved.

This is especially useful in security where the problem is unbounded, as cybercriminals are constantly changing their tactics to stay undetected.

However, these sophisticated algorithms are necessary but not sufficient for building an effective system. To do so, keep the following considerations front and center:

  • For starters, one has to begin with reliable, good quality data and specific, meaningful, and measurable properties called features. Designing, testing and generating such features requires deep domain expertise and iterative experimentation.
  • Second, extreme care should be taken when automating security decisions based on the results of machine learning models. In applications such as social media or movie recommendations, where the cost of making mistakes tends to be low, automating the decision process could be a no-brainer (though there could be expensive mistakes there as well, see the recent Google gorilla fiasco). But in a domain like security, you should carefully think about the risks of automation – especially when research shows that people often trust human judgment over algorithms. Designing a good human-in-the-loop system requires knowledge of the domain.
  • Finally, it is very important to provide context or evidence to support the results of these automated models. This again requires domain expertise. Algorithms can surface important correlations, but it takes an expert to weed out spurious effects. After all, correlation doesn’t imply causation. Take a look at Tyler Vigen’s website for amusing and absurd examples of such spurious correlations.

To summarize, domain expertise is key when designing a machine learning-based product for security. And, despite the risk of being labeled a shameless self promoter, let me tell you why I, as a machine learning practitioner, am really excited about Niara’s security analytics platform. I believe we have assembled all the ingredients necessary to make machine learning effective. We have built a scalable platform that provides detailed visibility into your data – all the way down to the packets – and enables analysts to quickly pull the context of an alert. We have done the undifferentiated heavy lifting for you so that your team can focus and follow through on their intuitions and leads.

And needless to say, we have best-in-class, state-of-the-art machine learning algorithms as part of our solution. Your team can make decisions based on data and not on the HiPPO (Highest Paid Person’s Opinion). Don’t just take my word for it, reach out and we will show you how.

Tags: Blog, Perspectives, Technology