We’ve published a number of blogs on the application of machine learning to security (3 Reasons Why Machine Learning is Not a Cybersecurity Pipe Dream, Your Levenshtein distance module told you something was amiss. Now what?, Machine Learning: Providing a Much Needed Assist for Cyber Security and Clearing the Air: How Machine Learning Can Enable Data-based Cyber Security). Machine learning models learn from example inputs in order to make data-driven predictions. The model chosen and the available training data clearly have an impact on the results produced. But there is one additional, very important group of actors that doesn’t get any of the limelight and is key to the success of applied machine learning, called features.
What are Features?
For any observation or set of observations, a feature is an attribute or variable that is useful or meaningful to the problem that you are trying to model. For example, assume you are working on natural language processing. In this case, a document, a tweet, or an email are some examples of observations. A phrase or word could be a feature.
It’s important to note that while there could be many attributes, not all are features. The font used in a document may be an attribute, but it wouldn’t be considered a feature because it isn’t useful for the natural language processing problem being modeled. The data may contain many features that are not pertinent to the problem you are modeling. Feature selection is the process of choosing the relevant predictors that will be used in your machine-learning model, ignoring the ones that add no value on top of what’s already been selected. Distilling the number of features down to the minimum requirement to describe a model is important because it reduces training times, makes it more general purpose by reducing overfitting and is simpler to understand and explain. An Introduction to Variable and Feature Selection provides a nice checklist that can help with feature selection.
Another source of features is through feature generation, a term that is interchangeable with feature extraction. In feature generation, new features are built from existing attributes with the goal of reducing the resources needed to describe a large data set. Analyzing complex data described by a large number of attributes generally requires a large amount of memory and computation power. Feature generation attempts to get around this problem while enabling models that can still produce accurate results.
Let’s assume that I want to build user behavior analytics (UBA) module that finds anomalies in daily behavior. To do so, I would aggregate things by the day – for example, how many times a day a user logged in. To generate this new feature, I would start with the raw logs, which contain the timestamp of log ins, and determine the number of log ins per day, per user. This information is not available with the raw log data, but it’s something that I would need to do because it is an interesting attribute that warrants analysis as it reveals something about the behavior of users.
The Importance of Domain Knowledge
It’s important to choose your features carefully, because choosing poorly will result in an unreliable model. And while there are feature selection algorithms and off-the-shelf tools to automate the process of choosing the right features, there is no substitute for domain knowledge – both for identifying the right features to use to build the model and in feature generation.
For example, in a multi-stage attack there is two-way communication between an infected computer within the corporate network and the external command and control (C&C) server that will control the behavior of the malware on the computer. Attackers will not use static domains for these servers because once the C&C domain is detected – and it will be if there is continuous suspicious communication to it – security teams can simply blacklist the domain to put an end to these communications.
Instead, attackers will resort to domain generation algorithms (DGA) to avoid detection. Domains are generally created to be human readable (e.g., darkreading.com, prediction.io, etc.). Using DGA, attackers automatically generate and register available domains, which are often made up of a random sequence of characters. Malware on an infected computer will also generate the list of domains that have been generated and registered by the C&C and will cycle through the list to find an active server from which it receives updates and commands. Domains are incredibly dynamic and have the ability to come and go offline quickly, which means that when a domain is detected and blacklisted, this traditional cyber defense is not effective.
Because data scientists focusing on security are aware of how sophisticated attackers operate, they know that domains made up of a random sequence of characters is a useful feature on which to train machine learning-based models. Being able to start with a hunch is key to quickly training models, otherwise the model may end up being trained on irrelevant features (e.g., the length of the domain, the suffix used, etc.). And while there is a small probability that you may stumble upon a successful model after using trial and error, a far more efficient way is to start with a hunch, based on domain expertise.
Data scientists must also account for the fact that certain well known top-level domains use automatically generated domains for various reasons. For example, Akamai, a CDN, uses machine-generated domains for traffic management and load balancing and features must be selected so that the model not only identifies the bad domains, but also automatically knows when domains are good.
Modern IT systems produce huge volumes of data, which promise to help businesses make smarter, more informed security decisions. However, at the end of the day, it’s not about data but about humans working with that data to make meaningful connections. When used alone, the mathematical models (e.g., cryptography to ensure secrecy) that previously aided security are no longer sufficient to protect against sophisticated attacks. Machine learning-based cyber security is the way of the future, but it is reliant on meaningful features being chosen. Data scientists with intimate knowledge of the various fields that comprise cyber security (e.g., network security, information security, application security, etc.) are best positioned to select the relevant features that describe patterns inherent in data, thereby enabling reliable models that produce accurate results and improve with experience.