Machine learning is not a feature customers are asking for. No one ever asks for data science, but they are asking for features that will utilise that technology.
Machine learning is an enabler. With it you can have confidence in alerts sent for rogue Wi-Fi hotspots, look for Shadow IT without having to troll through logs and set policies in advance.
At Wandera we have invested in the area heavily for a year or so and it is coming to fruition with malware detection. With so many variants and lots of authors, it’s hard to keep ahead of it especially if you are focusing on signatures.
Most companies will say they are doing some type of machine learning. We have, in the past, been quite secretive over how we do machine learning. However at our event Level in May, we asked the experts, our data science team, to give us a little more insight into the inner workings of our machine learning system, MI:RIAM.
What are we doing with machine learning?
Malware detection is the perfect example of a supervised learning problem. By finding various examples of known malware and taking that knowledge, along with the knowledge of what good apps look like (both from third parties and our own gateway).
You can start to train an algorithm to distinguish between them. This not only took the weight off our security team, but also helped in finding new malware. Polymorphic malware is on the rise, at the moment it is mostly seen in desktop, but it is only a matter of time before it moves to mobile. It is important to learn and extrapolate to find the patterns that link them to find new malware.
How do you differentiate between ‘goodware’ and malware?
There are various different data sources our data science team use. One is Koodous, an open source database where anyone can upload good ware or bad ware. It has in the realm of 20m apps which makes it a great source of information. Another well known dataset is Virus Total. Both of these are great data sources.
To distinguish between good and bad apps on device, this can be accomplished in one of three ways.
Feature extraction: what our Wandera app can find out about other apps on the device. Static analysis: here you take an app, download it and take it to the back end to do a more detailed analysis such as decompile the binaries and executables which are in the app. Lastly, running the app in a sandbox/virtual device to look at the file usage and network traffic.
What algorithms are you using and do these change day to day?
The team have experimented with a lot of different algorithms. The two currently used in production are support vector machines and logistic regression. The advantage for logistic regression is the output you get is very interpretable.
It gives you a probability of what your algorithm thinks this particular sample of being malware is. The probability gives you the freedom to be more sure that something is malware before raising the alarm. It can bring down the problem of false positives.
What do you do once you’ve identified malware?
The first level of malware detection is ‘Do we think it’s bad?’. Then the question is ‘How bad?’ or ‘What kind of bad?’. By working on multiclass classification it is possible to know what kind of behaviour is expected from the type of malware. ie. is it going to be a ransomware or is it just going to spam you with adverts? One is clearly worse than the other and needs to be handled differently.
What kind of infrastructure is needed to keep this running?
The main challenge specifically for malware detection is dealing with a huge amount of data. A normal training set for these algorithms might be 8 million features wide and to 2-4 million samples deep, this is a huge table with somewhere in the region of 30 trillion data points.
You then have, realistically speaking, a thousand different candidates of different models. Because of the changing nature of malware, typically looking at 20,000-50,000 new malware samples a day, each model needs to be trained in parallel so the best one can be picked for the needs at that point. Otherwise the model built is out of date from the moment it’s in production.
In order to make our services accurate we need to continue all those processes in real-time and keep all the models we deploy to production constantly updated. To achieve that big data computing architecture is being leveraged.
Apache Spark is a top player in the big data field right now. Cloud computing means it is possible to use Apache Spark and deploy a cluster of 1000 computing nodes, with each node matching a normal laptops computing power, in order to achieve that task in real-time.
What are we doing with anomaly detection?
There are three general anomalies that are looked for. Point anomaly, for example someone’s app is using a lot of data which wasn’t expected in that time frame. Contextual, say someone is roaming and their device does something that is not quite right within that context.
Lastly, collective anomaly, which is more about the sequence of the data. This is where machine learning can see all the trends and seasonalities to be able to pick out one thing and say this is a bad thing. 99% of the data seen is normal, you are looking for the needle in a haystack.
The main driver into anomaly detection is understanding what is going on right now. In our case it’s on the app level, on the device level and in general per customer. By understanding the normal, that is where the power comes from to identify what is bad, or might be bad and we can score it and give it a certain confidence.
How are we going to operate anomaly detection at scale?
The real challenge is doing everything in real-time. Real-time monitoring of the applications, domains, the user in general, a group of users, the customer as a group, and then at a global level. On top of that, we need analytics, alerting and reporting, again, all in real-time.
To achieve that the team moved from Apache Spark to Apache Flink. A more dedicated framework specifically for real-time streaming event processing. That really boasted the ability to massively parallel compute all these metrics and give confidence in the real-time detection and alerting.
We’ve come a long way in 2017 but as we fine-tune the technology there’s still a lot of room for development, which is very exciting.