Malware Detectors Stumble When Evaluated on Different Datasets

Basically, malware detectors often fail when faced with new types of malware they weren't trained on.
A new study reveals that malware detection models often fail when faced with different types of malware. This gap in effectiveness poses risks for organizations relying on these models. Understanding this issue is crucial for improving endpoint security and adapting to evolving threats.
What Happened
Recent research from the Polytechnic of Porto highlights a critical issue with malware detection models. These models are often trained on specific datasets and evaluated on similar data. However, the malware that enterprises encounter can be vastly different. This discrepancy raises concerns about the effectiveness of static detectors, which are commonly used as a first line of defense against malware attacks.
The European Union Agency for Cybersecurity has identified public administration as the most frequently targeted sector by malware. Ransomware and data intrusions are the primary vectors of these attacks. Many tools used in these intrusions employ obfuscation techniques to evade detection. The study aimed to test how well current machine learning-based static detectors perform when faced with malware that does not match their training data.
Who's Being Targeted
The research evaluated detection pipelines using a standardized feature format across six public Windows PE datasets. Two configurations were tested: one utilized a combination of the EMBER and BODMAS datasets, while the other included ERMDS, designed to challenge detectors with obfuscated samples. The models were assessed not only on their training data but also on external datasets like TRITIUM and INFERNO, which contain naturally occurring threat samples and red team malware, respectively.
The results revealed that while models performed well on their training distribution, their effectiveness dropped significantly when tested on external datasets. This finding is crucial for organizations that rely on these models for endpoint security, as it indicates that their defenses may not be as robust as they appear.
Signs of Infection
When evaluated on their training data, the best-performing models achieved high AUC and F1 scores, indicating strong detection capabilities. However, performance varied drastically on external datasets. For instance, while the models transferred well to TRITIUM, their detection rates dropped considerably on INFERNO and SOREL-20M, the latter being the largest and most temporally diverse dataset. Some configurations even fell to levels where their practical utility was limited.
The study also highlighted the obfuscation problem. Attempts to improve detection on obfuscated samples within the ERMDS dataset inadvertently reduced the models' generalization capabilities on broader datasets. This suggests that training to recognize one type of malware can create blind spots for others, complicating the detection landscape.
How to Protect Yourself
Static detectors are appealing due to their computational efficiency and quick verdicts. However, this study underscores a significant limitation: the benchmark performance of a detector is only relevant if the test data reflects the actual threat landscape. Organizations must consider the diversity of malware they might encounter, including packed malware and temporally shifted samples, which can degrade model performance.
The researchers plan to extend their evaluation to deep learning architectures, focusing on how training data composition impacts detection capabilities at low false positive rates. For organizations, this means continuously reassessing their malware detection strategies and ensuring they are equipped to handle the evolving threat landscape effectively.