Machine learning and AI

How to catch malware with artificial intelligence

Emmanuel Tsukerman
February 13, 2020 by
Emmanuel Tsukerman

In my first week in cybersecurity, I was in for a surprise. As a data scientist fresh out of school, I was recruited by an Internet-of-Things (IoT) security startup, and part of my job was to conduct customer interviews with large hospital information security managers to get a sense of market needs. 

A clear pattern emerged; all IT security managers were expressing the same fear: the fear of ransomware.

I wondered at the time, “Since ransomware is a virus, why doesn’t antivirus stop ransomware?” Digging deeper to gain clarity, I learned that antivirus (AV) relied on:

  1. Signatures: snippets of code or derivatives thereof that indicate a sample is malicious — such signatures are extracted and catalogued when past history of malicious behavior is known
  2. Heuristics: rule-based indicators that the sample is malicious, such as calls to dangerous functions, verdicts from execution in a lab environment and similarity scores with respect to other malicious samples

Learn Cybersecurity Data Science

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

In 2017 the WannaCry outbreak made it evident AVs weren’t cutting it, leading to headlines like "Government under pressure after NHS crippled in global cyberattack as weekend of chaos looms" and "Unprecedented cyberattack hits 200,000 in at least 150 countries, and the threat is escalating." 

If a shred of a doubt remained about the ineffectiveness of AVs against ransomware, the NotPetya attack a month later, assessed at more than 10 billion dollars in damage, left none.


How ransomware beats signatures and heuristics


What caused such outbreaks was that attackers had found sneaky ways to dominate AVs and leave signature and heuristic methods in the dust. 

Signatures have an important place in AVs in reliably preventing known viruses, but nearly a million new malware samples were being released each day. There was no way signatures alone could keep enterprises, customers and ourselves secure. In addition, obfuscation techniques (which you can learn about here) rendered the use of signatures even less effective.

Heuristics, too, do not suffice. Though these offer a great and important supplement to signatures by being able to catch malware that has not been catalogued, heuristics encounter significant challenges against more sophisticated strains. For instance, strains that look nothing like previously seen samples, that contain anti-virtualization techniques or that behave differently in different environments are able to evade most heuristic methods.

Through a careful study of the chinks in the armor of AVs at the time, it became evident to us that a machine learning approach was necessary to keep our way of life secure. That led to us pivoting to engineer and launch an award-winning anti-ransomware product.


Fighting ransomware through machine learning


The most critical advantages machine learning offers in the area of malware detection is the:

  • Ability to detect zero-day samples
  • Speed of detection

With so many new samples being released each day, the need for AVs to be able to catch zero-day samples transformed from being of secondary importance to becoming a central requirement. 

Compared to certain heuristic approaches, such as bare metal analysis or execution in a virtual machine, machine learning allows us to analyze the sample quickly and in real time. On the other hand, a signature-based approach suffers from a “patient zero” problem, where someone must be infected and recognized as such before a malicious signature is created and propagated.

When architecting the cloud anti-malware machine learning service at Palo Alto Networks (PAN), I observed first-hand the need for effective and accurate machine learning for malware detection. Despite the impressive bare metal and virtual machine infrastructure at PAN, terabytes of samples flowing through customer firewalls meant that no infrastructure could possibly handle the large queue of samples for analysis that builds up from the file traffic of over 30,000 enterprise customers. 

The need for a machine learning system stemmed directly from customers’ need to be safe and secure from new malware — and to have a verdict in real time. 


The adoption of machine learning


As a result of the advantages mentioned above, every commercial AV now relies on machine learning. 

To those interested in entering the profession, the skills of setting up a malware analysis lab and performing basic static and dynamic analyses are fundamental, and are covered in a friendly and engaging manner in my Cybersecurity Data Science Learning Path for Infosec Skills.

There are many variations on the machine learning and data science techniques being utilized, and these account for much of the AV’s effectiveness in dealing with new samples:

  • Some employ machine learning-based static analysis, while others employ machine learning-based dynamic analysis or a hybrid approach
  • Some utilize tree-based models, while others rely on neural networks and deep learning
  • Some utilize hand-engineered features, while others extract features automatically

The method for measuring the effectiveness of the machine learning malware detection component also varies from organization to organization. Knowing how to select the correct model, features and metrics makes or breaks a machine learning system. 

Consequently, an important part of the Cybersecurity Data Science Learning Path is teaching security professionals not only how to implement machine learning to catch malware, but also why we do it the way we do. That way you’ll be equipped to make your own informed decisions in the field. 


What’s next for machine learning and cybersecurity


Cybersecurity is ever-changing, and as new and evolving methods of attack emerge, infosec professionals are entrusted to observe and combat those trends.

One trend I and other security professionals have been observing is the use of adversarial attacks on machine learning. These are attacks that specifically target the machine learning component of the AV and attempt to trick it into labeling a malicious sample as benign. Such attacks are still an emerging trend as of early 2020, but they are expected to become more important — and are already part of large enterprises' penetration tests.

To aid industry-wide preparedness for this new development, I’ve designed a new learning path for Infosec Skills, Machine Learning for Red Team Hackers, to teach security professionals how attacks on machine learning are executed. Some of the topics include:

  • Attacking machine learning systems
  • Constructing malware to evade classifiers
  • Performing adversarial attacks to trick neural networks
  • Attacking commercially available image recognition systems
  • Poisoning, backdoor attacks and model stealing on machine learning systems

You can explore the new courses below.


Machine Learning for Red Team Hackers

Emmanuel Tsukerman
Emmanuel Tsukerman

Dr. Tsukerman graduated from Stanford University and UC Berkeley. In 2017, his machine-learning-based anti-ransomware product won Top 10 Ransomware Products by PC Magazine. In 2018, he designed a machine-learning-based malware detection system for Palo Alto Network's WildFire service (over 30,000 customers).

In 2019, Dr. Tsukerman authored the Machine Learning for Cybersecurity Cookbook and launched the Infosec Skills Cybersecurity Data Science Learning Path.