2
In the last GCI published in 2020, Honduras performed
in the last place of the GCI in the America Region and was
placed in the 178
th
position out of 182 countries with a
score of 2.2 out of 100 (International Communication
Union, 2020). This means almost every person using the
internet does not actually know the risks of the ever-
increasing malicious software out there.
Malware is defined as malicious software that is
intentionally placed or inserted into a system to harm
(Stallings, 2006) and it has been known for some time as
one of the strongest threats on the internet. State-of-the-
art software for virus detection and prevention (antivirus)
has been quite successful.
However, antivirus providers face problems due to the
large number of variations of malware that are produced
daily. Ciampa (2021) highlights in the 2018 McAfee Labs
threats report that the number of new malware released
every month exceeds 20 million, and the total malware in
existence is approaching 900 million instances. In 2019,
four out of every five organizations experienced at least
one successful cyberattack, and over one-third suffered
six or more successful attacks (Cyberedge Group, 2020).
The organizations that oversee dealing with this type
of threat increasingly require better techniques for the
automated classification of malware samples in general.
Malware classification is a process that was traditionally
done manually (Tian et al., 2009; Gheorghescu, 2005).
This became inefficient over time because of the large
number of malware samples emerging daily product of the
polymorphism, metamorphism, and obfuscation
techniques involved in modern malware. As a
consequence of the inefficiency of this process, the need
of automating and standardizing this process arises.
Malicious network traffic samples should be identified
with the least possible margin of error by this automated
process.
One of the most popular approaches for malware
classification is based on content. This checks the content
of the files and compares them with signatures from a
database, looking for matches with previously identified
malware samples. Some research works (Tian et al., 2009)
concentrate on Malicious Executable Classification
Systems (MECS) that distinguish between benign or
malignant executables. However, this approach cannot
recognize new variants of already known families without
having an existing sample of these.
Another approach for malware classification is based
on behavior, which is subdivided into two types: based on
Central Processing Unit (CPU) and based on the data
network traffic. The first one analyzes and monitors the
behavior of programs on the computer. The second one
analyzes and monitors incoming and outgoing data
packets, connections to hosts, and others. Even though
monitoring and processing system calls can be a resource
intensive task (Nari & Ghorbani, 2013), most of the works
using CPU-based classification are based on system calls,
used to abstract, and represent malware behavior. Nari &
Ghorbani (2013) proposed the behavior-based approach
via data traffic network under the assumption that when a
new variant of malware emerges, it will show similar
behavior to its predecessor regardless of the obfuscation,
polymorphism, or metamorphism used to create it. Today
we can find numerous investigations (Hock & Kortis,
2015; Chockwanich & Visoottiviseth, 2019; Jabez &
Muthukumar Dr., 2015; Yin et al., 2017) that show the
behavior of malware in the data traffic network as an
essential component.
Identifying malware by the network traffic is quite the
same as the intrusion detection systems based on the
network. An intrusion detection/prevention system
(IDS/IPS) is a security tool that can detect malicious
activity and taking preventive measures to protect both the
host and the network against potential threats, which
would normally pass through a traditional firewall
(Ambati & Vidyarthi, 2013; Kolokotronis & Shiales,
2021). IDS/IPS are divided into two categories: Host
Intrusion Detection/Prevention System (HIDS/HIPS) and
Network Intrusion Detection/Prevention System
(NIDS/NIPS). HIDS/HIPS are user (host) based IDS/IPS.
These are used to analyze and monitor activities in a
particular machine. NIDS/NIPS detect and prevent
intrusion threats by continuously monitoring data network
traffic, looking for malicious and unauthorized entries that
attempt to harm the basic security of the data network.
These systems take automatic action to stop the intrusion
by sending alerts to the administrator, dropping, or
blocking malicious traffic from the source address, or
terminating the connection (Kolokotronis & Shiales,
2021).
Shipulin (2018) explains the technology behind the
NIDS/NIPS systems. These works at layer 4 of the OSI
(Open Systems Interconnection) model (Purser, 2004).
That is, with transport layer protocols such as TCP
(Transmission Control Protocol), UDP (User Datagram
Protocol), and others. The goal is to identify malicious
packets in data network traffic representing attack
attempts. The incoming traffic is divided into its
corresponding protocol, and it is decoded, decompressed,
normalized, and later compared it with a set of signatures.
This research work is based on the premise that any
new variant of malware behaves similarly to that of its
predecessor (Tian et al., 2009), together with the fact that
most malware communicates with external hosts (Nari &
Ghorbani, 2013). The proposed model bases its operation
on the behavior-based approach at the data network level
(Nari & Ghorbani, 2013). This model parses files
containing frames and packets captured from the network,
known as (Packet Capture) PCAP files.
This model employs two methods for classifying
malware-generated traffic samples: (a) using traditional
machine learning algorithms such as K-Nearest
Neighbors (K-NN) and Support Vector Machines (SVM)