A Comparative Analysis of Machine Learning Techniques for Malware Detection in Downloaded Files
Abstract
The widespread availability of malware presents a significant and growing threat to contemporary societies. The internet hosts numerous types of viruses and other dangerous programs. Research has indicated a substantial increase in the prevalence of malware over the past decade, resulting in substantial financial losses for numerous organizations. Malware is a type of software that can cause significant harm to a user's computer system. There are various ways in which a user system can be affected. To address this issue, a range of machine learning methodologies are employed in the suggested solution to determine the presence or absence of malware within a downloaded file obtained from the internet. The goal is to effectively distinguish between dangerous and harmless files using multiple machine-learning methods. The primary objective is to analyze various attributes of the downloaded file, such as the MD5 hash, size of the Optional Header, and the Load Configuration Size, to determine if the file is malicious or non-malicious. The classification of files was based on the results of this analysis. The models underwent training using various features, which enabled them to acquire the knowledge necessary for file classification. After appropriate training, the models were evaluated and compared using multiple criteria on the validation and testing datasets. This comparison was used to determine the model with the highest level of accuracy. This approach enables the identification of various forms of malware that can cause significant harm to the user's computer system. The methodology used in this study allows the identification of several types of malwares, including Adware, Backdoors, Trojan, Unknown, Rbot, Multidrop, Spam, and Ransomware. Machine learning techniques were used to detect malicious programs. The dataset contained 100000 samples and 35 characteristics from PE headers and sections. Correlation analysis was used to select important features, reducing them to only the necessary ones. We compared and optimized XGBoost and LightGBM using an 80/20 split for training and testing. Data visualization tools were used to analyze feature linkages and distributions. Benchmarking showed XGBoost and LightGBM was the most accurate (99%) for malware detection. This data-science-based procedure enhances malware detection. It shows how machine learning can be used to detect malicious code in executables.
How to Cite This Article
Ghaith Mousa Hamzah Amlak (2025). A Comparative Analysis of Machine Learning Techniques for Malware Detection in Downloaded Files . Journal of Frontiers in Multidisciplinary Research (JFMR), 6(2), 457-463 .