Machine Learning Defined
TOM FIELD: I find that when I'm out talking with security leaders now, the concept of machine learning comes up much more frequently. What do you find in your own conversations that people tend to misunderstand about machine learning?
STEPHEN NEWMAN: Actually, in my experience most really don't understand machine learning beyond the general concept that a computer predicts some outcome based on data. Perhaps a quick analogy would help. The best machine learning system in the world is really the human brain. We take in data rapidly and automatically process it into either an emotion or some decision or some course of action. For example, as you walk down a city street, you notice your surroundings, right? It's a sunny day, the sidewalk is clean of litter, people walking next to you are dressed in business attire, shops and businesses are open. Well, your brain decodes all this information and tells you that you can continue to walk down the street safely. You may not be thinking about this, but really it's a subconscious thought.
However, if you turn the corner and walk maybe a half a block down and suddenly the street is darker due to the surrounding buildings, or there's litter on the ground, or you smell trash and the windows to the businesses have bars on them, and there's a group of men at the end of the street with hoodies on that are all gathered closely together ... well, your brain automatically takes in this data and perceives danger guiding you to turn around and quickly not proceed.
What's interesting about the two scenarios is no single piece of data in either example means you're either safe or in danger, but as you put all those pieces together, your brain guides your actions.
Well, software based machine learning attempts to emulate the way the brain works in processing data. In software-based, we gather data; in the brain, as in the example above, we gather the information around our surroundings. Our brain creates features that allow us to describe the differences in information. So in the example above, the features would be the presence of litter, what people are wearing, light versus dark on the street, how open the shops look for business. And then third, we identify algorithms to help guide our decision-making process based on the data collected in the feature description. So, for example, we might weight certain presence of litter or what people are wearing or light and dark and shops open differently, and then we have kind of -- we combine all these features together ultimately into an algorithm. Now that we have a statistical model, let's call it a classifier in our brain, that if we ever come across these features again, we can rapidly make decisions to proceed or turn around and walk away.
Advanced Threat Protection
FIELD: Stephen, that's a useful definition. Now how do you define machine learning in the context of advanced threat protection?
NEWMAN: Well, let's take the analogy we just created and apply it to advanced threats. So, Damballa's Failsafe Solution for Enterprises, it has behavioral statistical models or what we call "classifiers" that are built from our machine learning systems, and those classifiers consider a lot of different features that describe the behavior of any device in an enterprise network and its communications from that device over time to determine if that device is compromised.
So let's say, for example, that Tom, your laptop communicates to a website that's legitimate. Let's call it maybebad.com, and we labeled it maybebad.com because let's say we know that it's been hacked by a threat actor and it also acts as a command-control proxy for compromised machines to talk to him. Well, like the litter example above, the fact that your laptop, Tom, is talking to this website maybebad.com, that doesn't really mean you're infected, but it is an interesting thing to consider if given other context clues. So let's say Damballa's Domain Reputation System, also built from machine learning systems, awards a gray reputation to this website maybebad.com. This means that it's not necessarily malicious, but it also isn't necessarily benign. So then let's study the communication behavior of your laptop, Tom, to maybebad.com. In studying that activity of the communications over time, we see that over the last three hours your laptop has visited maybebad.com on an average of every 19 minutes with the standard deviation of 38 seconds. Well, what Damballa Failsafe for Enterprise does is: Machine learning classifiers within the systems will look at those communications from your device over time and determine if they are statistically more like that of a piece of software rather than a user. Well, now that we know Tom's device is talking to a possibly malicious website in an automated fashion, armed with this information we can quickly conclude that Tom's device is compromised. That's machine learning at work, taking in those different feature sets, understanding and bringing in context clues to make those decisions.
Getting the Most from Machine Learning
FIELD: So, Stephen, within an enterprise today, what do you find is necessary to make machine learning most efficient?
NEWMAN: You know, actually machine learning as itself, whether or not in an enterprise or wherever, there are some things that are critical that you must have. You need a large dataset of unbiased data. When I'm talking about unbiased data, imagine the example of the street. If we only ever trained a machine learning system on dangerous streets full of trash and litter and people in hoodies, then you're only ever going to train on the bad set and never be able to differentiate between the good. So you need a large dataset of unbiased data, both good and bad.
What you also need are data scientists. This is a critical element. Machine learning doesn't happen without expertise, right, so you actually need two sets of expertise. You need the data scientists who are going to build and have deep understanding of the different type of machine learning models you can apply, but you also need subject matter experts that can label the data that they're seeing and help guide the data scientist in which features they actually pick for the machine learning model.
Then the last piece is you need a plan for how you're going to actually apply the results from the machine learning. You can't just assume that you're going to build a machine learning system and just have that solve the problem. You actually need a plan for how you take the results from the machine learning system and apply it to your enterprise network.
Data Scientists Needed
FIELD: I want to come back to this topic of data scientists. Within an organization, how should we rank the importance of having distinct data scientists versus training our existing staff to work with the machine learning tools?
NEWMAN: Well, frankly, it's critical. Let's consider some of the challenges. So often I hear of big data systems where a vendor will sell an enterprise a specialized hadoop cluster to collect all the enterprise data, and then your employees can use it to perform machine learning. Well, frankly, it usually takes a data scientist and a subject matter expert to get the most out of these systems, and frankly those are employees that most companies don't have. So the problem is really on two fronts: How many specialized resources must you hire. and then frankly do you have a way again of actually applying the results of the findings?
Practical Tips
FIELD: Well, this has been great context, but if we get practical with this now, what are some ways that you recommend organizations apply machine learning specifically for advanced threat protection?
NEWMAN: Well, our model is not to apply a big hadoop cluster within the enterprise, but instead to do the machine learning in the cloud with effectively infinite computing horsepower. Then apply those results using those statistical models, those classifiers against network traffic. Those statistical models and classifiers are actually fairly lightweight. This is efficient and works within the computing constraints, resources constraints and time constraints that exist in the enterprise, plus it doesn't require the customer to hire additional resources.
Lessons Learned from Customers
FIELD: Stephen, let's talk a little bit about your customer experience. What have some of Damballa's customers learned from their own adventures with machine learning?
NEWMAN: We do have some customers that have their own machine learning capabilities. They use our technology as a ground truth or build additional techniques on top of our findings. However, most customers simply don't have the capacity to do this, so they use our technology to do the dirty work and then build their incident response workflow around our findings around advanced threats.
Where to Start?
FIELD: So, the final question I have for you, and it might be the most important one: Where do you begin? Understanding what it is that we want to achieve with machine learning and advanced threat protection, what's the starting point?
NEWMAN: Well, as I mentioned, the vast majority of enterprises will never have the resources to build machine learning systems internally. The question of where to begin is really a matter of the results they are trying to achieve with their security program. If an enterprise is evaluating advanced threat detection solutions, knowledge of the role machine learning plays in underlying technology is helpful. For example, if you're comparing solutions based on machine learning, here's some questions you should really ask those vendors. First one, what volume of data do you monitor daily? Because remember: We need a large dataset of unbiased data. Where does the data come from? A single enterprise, a handful of enterprises, ISPs, etc.? How globally disbursed are those sources? Does the vendor train on their live network traffic? Does the vendor actually publish their machine learning research? One of the things that's unique about Damballa is all of our machine learning systems in our research, we actually publish that out so people can read and learn from those systems. And then lastly, are your machine learning systems simply anomaly-based evasion techniques that are going to generate a lot of alerts and you don't really have a lot of confidence in them, or are they built statistical behavioral models that produce certainty with findings? Naturally, what we focus on is building those machine running systems that have very clear statistical models for how a compromised device will communicate.