Machine Learning is the new buzzword. But what does it actually mean and is it important? Miranda Mowbray has worked in Machine Learning and cybersecurity for many years. At our conference LEVEL, she explained what Machine Learning is and what the implications are of its use in cybersecurity.

Understanding Machine Learning

Historically, software engineering involved having an input, with fixed software which knows how to solve that input, followed by an output. Machine Learning is more like a recipe that says ‘add salt to taste’. You have a program but it does not have all the values filled in, so it learns itself what are the best values based on the output of previous experience.

Supervised and unsupervised Machine Learning

There are two types of Machine Learning – supervised and unsupervised. In supervised Machine Learning you give the software a training set, this is a set of data where you know what the answers should be. The machine works out the ‘amount of salt’ that gets nearest to what these numbers ought to be. Once you have that, you have the best version of the software and you can run it on other sets.
In unsupervised, you don’t know what the answer is going to be. There are several methods to go about this, one is an anomaly detection program. In this case, you give the machine data and it will find the points that just look weird. These are any values that are very different to other points in the set.

Miranda Mowbray

Why use Machine Learning for cybersecurity?

Symantec found 350 million new variants of malware last year. They could not have been all man made/handcrafted. There had to be malware that was automatically creating new variants. Which is exactly what was happening. The only way to detect them was at a higher level, looking at the patterns that are common to all the variants. This is where Machine Learning comes in.

Challenge 1. Identifying what is actually bad

The main challenge with using Machine Learning is knowing whether or not what you’re looking at is actually bad. There are millions of emails, files, websites, etc., so being able to differentiate what is bad from what is just odd is key.
In one example, Miranda’s team did some Machine Learning on the suffixes of the websites people were accessing. They found there were a group of employees who had a wide distribution of suffixes and they were going to a lot of websites which ended in ‘.ee’. On further inspection, they realized ‘.ee’ stood for Estonia and the employees were not infected, they were Estonian.

Rare doesn’t always mean bad. Miranda Mowbray, Data Scientist

Challenge 2. False alarm problem

The false alarm problem has led to companies switching off their very expensive security software. If, for example, you expected one bad event per million communications and the detection method has a 0.1% false alarm rate. That means one time in 1,000, it will look at an event which is good and think it is bad. This would lead to nearly 1,000 false alarms per true alarm.

If you had a burglary alarm that gave you 1,000 false alarms per true alarm, you would switch it off. Miranda Mowbray, Data Scientist

Challenge 3. The true alarm problem

Dealing with big data, you also have a true alarm problem. Security often looks at DNS events. This happens anytime a machine in the corporate networks tries to access a website. If you are analyzing 18 billion events a day and one in a million connected to an attack. That would be 12.5 true alarms a minute. Again, if you had a burglar alarm that went off that often, you would switch, even if you were being burgled.
Miranda Mowbray

How to manage the challenges

With so many alarms to assess, it is important to prioritize alarms for high severity threats. For example, those that could make the company go bust vs those that would just send employees more ads.
You also need to be able to cluster alarms so you only get an alarm once for the same underlying problem. If you’re really sure what is happening in the data, you can bypass the security team and send a message directly to the owner of the infected machine and tell them how to fix it.
You do not want to look for every rare thing, only rare events that are consistent with a known attack methodology. Another thing that is useful is an allow list. Allow list your security department or your Estonians.
It’s important to remember that in a cyber attack there is an adversary – there is somebody who is actively out to fool you. This makes it more of a challenge for Machine Learning as your threat landscape keeps changing as your adversary keeps changing their tactics to try and get around you.
So you have to keep retraining and have some source of information that you can rely on about what really is an attack and what isn’t according to today’s threat landscape.

An example of black-list evading ransomware

Traditionally, if hackers wanted to infect someone with malware, they would need to set up a communication channel between an evil domain and the mobile device in question. This would need to be outside of the corporate network.
However, if they did this, the security manager in the organization would notice the evil domain and block it. From then on, all communications to that domain would be blocked, blocking the malware and stopping the attack.
A few years ago, malware authors found an ingenious method where they never used the same domain twice. Instead of a fixed domain, there is an algorithm in the malware. Every day the malware would generate a large number of domains, the mobile device would try to connect with them all and only one of those domains will be the evil one which will encrypt your data. If that one gets blocked, attackers can use another the next day.

Very cool, evil but very cool. Miranda Mowbray, Data Scientist

When this is mapped out visually, it looks very strange – one mobile device trying to access lots of domains at the same time. For this piece of malware, this happens before the device is encrypted so you can stop it before an attack happens.

Miranda Mowbray puts Machine Learning into practice

The domains used in that attack don’t look normal and they are different every day. While a human might be able to notice they don’t look normal, how would a machine know? Machine Learning can detect them due to regularities in the domain generator.
If you have the examples of the malicious domains you can use supervised Machine Learning to formulate detection. If you don’t have the examples, you need to use unsupervised machine learning. Miranda’s team took a single machine, looked at five days of data over quite a few networks. She analyzed the lengths of the domains and then found 19 new variants of this malware, nine of which they had not seen before.

We will need to use Machine Learning to protect ourselves. Miranda Mowbray, Data Scientist

Securing the Internet of Things

The Internet of Things is not going anywhere. As the number of objects that can connect to the internet increases, so does the number of products which are completely unsecure. If harnessed, they could cause havoc. In order to make sense of it, we will need to use Machine Learning. It is necessary to use it to protect ourselves.
Watch the full presentation by Miranda Mowbray at LEVEL 2017 here.
[text-blocks id=”3610″]