Better Understanding Distributed Malware via Graph Analytics and Machine Learning

11 Feb

Better Understanding Distributed Malware via Graph Analytics and Machine Learning

in Blog, Perspectives

by Dr. Ankur Teredesai

If you read the Mandiant APT1 report or other similar high-profile analyses of long-running, targeted and highly-damaging attacks, a trend becomes very clear: threat actors target multiple organizations, in multiple verticals, over multiple years, from multiple hop-off points (sometimes from one victim to another to another!)

A compromise in cyberspace is not like a missile whose trajectory follows the laws of physics and can be traced back to where it was launched. Malicious flows can come from anywhere, at any time, and look like just another HTTP connection in a sea of HTTP flows. On the hopeful side, almost all of these reports today include indicators of compromise (IOCs) and Observables, which are time-based facts that can be used to separate friend from foe, benign behavior from malicious.

There are many problems with heuristic-based detection of malware or network flows that limit usefulness in detecting compromise. Too many malicious activities can be made to look benign, and signatures can be adapted, often frequently, such that it is impossible to catch all malware and malicious system use. What is proving more effective is using machine learning techniques to classify and cluster activity within large, multi-dimensional event data streams. Combined with the availability of high-quality and timely IOC and Observable streams, classification and clustering can be enriched and refined to not only improve detection in the future but also to go back in time and correct past false-positives from older heuristic-based detection mechanisms that are still in use today.

Machine learning helps to extract “actionable intelligence” insights, if you will, from large volumes of data. What type of data? Security related event data that consists of server logs, firewall logs, security detective device logs, sandbox analysis output and “IP reputation” data streams, which are the most common event logs. Observables include cryptographic hashes of malicious programs, Windows Registry keys, IP addresses, ports and protocols involved in network flows, which are the sequences of processes initiated on a computer in response to a network connection from inside or outside of the targeted network. In short, machine learning helps identify the signal in an increasingly large sea of noisy event data.

However, it’s not completely straightforward to apply machine learning to resolve cybersecurity challenges. Part of the difficulty in applying machine learning to mine large volumes of data, involves selecting the most useful features, and validating the results of using these features for effective clustering and classification. Another challenge is determining “normal” from “anomalous.” These can be refined and validated using threat intelligence from IOC bundles like those accompanying the Mandiant APT1 report, or threat intelligence feeds available in the computer security research, operations communities and from private sector service providers. Yet, another problem is simply visualizing the topology of large network graphs and deriving the actionable insights mentioned earlier.


Graph analytics, the study of these large network graphs, is a great way to track distributed malware. For example, in the figure above, the central image (labeled “Present”) represents the present state of a network, and each red dot indicates a potentially compromised device. Machine learning techniques, when applied to these graphs, can automatically and efficiently reveal an additional layer of information about distributed malware that’s not possible with other techniques. Using machine learning in conjunction with graph analytics makes it possible to estimate what this network with potentially infected devices looked like in the past (the image on the left, labeled “Past”). Users can also extrapolate how the infection would spread across the network in the future (the image on the right, labeled “Future”). This approach for attack pattern extraction is a great way to use the past to help prevent attacks that could happen in the future. And while there currently aren’t any commercial solutions in production, its possibilities make it an interesting field of study.

For example, a student at the University of Washington Tacoma helped write a network crawler for one of the first successful peer-to-peer botnets. The data collected from his crawler showed the botnet had a randomly distributed graph topology with uniform (and limited) in-degree (number of connections that link to you) and out-degree (number of connections that you link to) connections. This meant it had no central nodes that could be taken out to disrupt the botnet, as well as that fact that almost no existing graph tools could visualize the network. To get around this limitation, a real-time animation of the act of crawling the botnet to discover the graph was used to help illuminate the actual internal topology.

Extraction of such graph-based features is becoming critical to threat analytics, and we at the Center for Data Science work with industry collaborators, such as Niara, to advance the state of the art. Extracting egonet features and developing graph-based anomaly detection algorithms to help address the needs of the cybersecurity industry. Machine learning and graph analytics help better understand the nature of the complex distributed malware being used to compromise networks every day. And with that better understanding, it can help improve defenses and maintain the integrity, availability and confidentiality of our information and information systems. More advanced research is needed to bridge the cybersecurity and machine learning communities, particularly as complexities of addressable problems and needs for scalable solutions increases in both fields.


Dr. Ankur Teredesai is a Professor of Computer Science and Systems at the University of Washington and heads up the Center for Data Science. The content for this blog was created with contributions from Dave Dittrich and Anderson Nascimento, both from the University of Washington.

Tags: Blog, Perspectives