[ Pobierz całość w formacie PDF ]
Data Mining Methods for Detection of New Malicious Executables
Matthew G. Schultz and Eleazar Eskin
Department of Computer Science
Columbia University
Erez Zadok
Department of Computer Science
State University of New York at Stony Brook
ezk@cs.sunysb.edu
mgs,eeskin
g
@cs.columbia.edu
Salvatore J. Stolfo
Department of Computer Science
Columbia University
sal@cs.columbia.edu
Abstract
Malicious executables are also used as attacks for many
types of intrusions. In the DARPA 1999 intrusion detection
evaluation, many of the attacks on the Windows platform
were caused by malicious programs [19]. Recently, a mali-
cious piece of code created a hole in a Microsoft’s internal
network [23]. That attack was initiated by a malicious ex-
ecutable that opened a back-door into Microsoft’s internal
network resulting in the theft of Microsoft’s source code.
Current virus scanner technology has two parts: a
signature-based detector and a heuristic classifier that de-
tects new viruses [8]. The classic signature-based detec-
tion algorithm relies on
signatures
(unique telltale strings)
of known malicious executables to generate detection mod-
els. Signature-based methods create a unique tag for each
malicious program so that future examples of it can be cor-
rectly classified with a small error rate. These methods do
not generalize well to detect new malicious binaries be-
cause they are created to give a false positive rate as close to
zero as possible. Whenever a detection method generalizes
to new instances, the tradeoff is for a higher false positive
rate.
Heuristic
classifiers are generated by a group of virus
experts to detect new malicious programs. This kind of
analysis can be time-consuming and oftentimes still fail to
detect new malicious executables.
We designed a framework that used data mining algo-
rithms to train multiple classifiers on a set of malicious and
benign executables to detect new examples. The binaries
were first statically analyzed to extract properties of the bi-
nary, and then the classifiers trained over a subset of the
data.
Our goal in the evaluation of this method was to simu-
late the task of detecting new malicious executables. To
do this we separated our data into two sets: a
training set
and a
test set
with standard cross-validation methodology.
The training set was used by the data mining algorithms to
generate classifiers to classify previously unseen binaries
as malicious or benign. A test set is a subset of dataset that
had no examples in it that were seen during the training of
an algorithm. This subset was used to test an algorithms’
performance over similar, unseen data and its performance
A serious security threat today is malicious executables,
especially new, unseen malicious executables often arriv-
ing as email attachments. These new malicious executables
are created at the rate of thousands every year and pose a
serious security threat. Current anti-virus systems attempt
to detect these new malicious programs with heuristics gen-
erated by hand. This approach is costly and oftentimes inef-
fective. In this paper, we present a data-mining framework
that detects new, previously unseen malicious executables
accurately and automatically. The data-mining framework
automatically found patterns in our data set and used these
patterns to detect a set of new malicious binaries. Com-
paring our detection methods with a traditional signature-
based method, our method more than doubles the current
detection rates for new malicious executables.
1
Introduction
A malicious executable is defined to be a program that per-
forms a malicious function, such as compromising a sys-
tem’s security, damaging a system or obtaining sensitive
information without the user’s permission. Using data min-
ing methods, our goal is to automatically design and build
a scanner that accurately detects malicious executables be-
fore they have been given a chance to run.
Data mining methods detect patterns in large amounts of
data, such as byte code, and use these patterns to detect
future instances in similar data. Our framework uses
clas-
sifiers
to detect new malicious executables. A classifier is a
rule set, or detection model, generated by the data mining
algorithm that was trained over a given set of training data.
One of the primary problems faced by the virus com-
munity is to devise methods for detecting new malicious
programs that have not yet been analyzed [26]. Eight to ten
malicious programs are created every day and most cannot
be accurately detected until signatures have been generated
for them [27]. During this time period, systems protected
by signature-based algorithms are vulnerable to attacks.
1
over new malicious executables. Both the test and train-
ing data were malicious executables gathered from public
sources.
We implemented a traditional signature-based algorithm
to compare with the the data mining algorithms over new
examples. Using standard statistical cross-validation tech-
niques, our data mining-based method had a detection
rate of 97.76%—more than double the detection rate of a
signature-based scanner over a set of new malicious exe-
cutables. Comparing our method to industry heuristics can-
not be done at this time because the methods for generating
these heuristics are not published and there is no equiva-
lent or statistically comparable dataset to which both tech-
niques are applied. However, the framework we provide is
fully automatic and could assist experts in generating the
heuristics.
cious code based on “tell-tale signs” for detecting mali-
cious code. These were manually engineered based on ob-
serving the characteristics of malicious code. Similarly, fil-
ters for detecting properties of malicious executables have
been proposed for UNIX systems [10] as well as semi-
automatic methods for detecting malicious code [4].
Unfortunately, a new malicious program may not con-
tain any known signatures so traditional signature-based
methods may not detect a new malicious executable. In an
attempt to solve this problem, the anti-virus industry gen-
erates heuristic classifiers by hand [8]. This process can
be even more costly than generating signatures, so find-
ing an automatic method to generate classifiers has been
the subject of research in the anti-virus community. To
solve this problem, different IBM researchers applied
Arti-
ficialNeuralNetworks
(ANNs) to the problem of detecting
boot sector malicious binaries [25]. An ANN is a classifier
that models neural networks explored in human cognition.
Because of the limitations of the implementation of their
classifier, they were unable to analyze anything other than
small boot sector viruses which comprise about 5% of all
malicious binaries.
Using an ANN classifier with all bytes from the boot sec-
tor malicious executables as input, IBM researchers were
able to identify 80–85% of unknown boot sector mali-
cious executables successfully with a low false positive rate
(
< 1%
). They were unable to find a way to apply ANNs to
the other 95% of computer malicious binaries.
In similar work, Arnold and Tesauro [1] applied the same
techniques to Win32 binaries, but because of limitations of
the ANN classifier they were unable to have the comparable
accuracy over new Win32 binaries.
Our method is different because we analyzed the en-
tire set of malicious executables instead of only boot-sector
viruses, or only Win32 binaries.
Our technique is similar to data mining techniques that
have already been applied to Intrusion Detection Systems
by Lee et al. [13, 14]. Their methods were applied to sys-
tem calls and network data to learn how to detect new in-
trusions. They reported good detection rates as a result of
applying data mining to the problem of IDS. We applied a
similar framework to the problem of detecting new mali-
cious executables.
2
Background
Detecting malicious executables is not a new problem in
security. Early methods used signatures to detect malicious
programs. These signatures were composed of many dif-
ferent properties: filename, text strings, or byte code. Re-
search also centered on protecting the system from the se-
curity holes that these malicious programs created.
Experts were typically employed to analyze suspicious
programs by hand. Using their expertise, signatures were
found that made a malicious executable example different
from other malicious executables or benign programs. One
example of this type of analysis was performed by Spafford
[24] who analyzed the Internet Worm and provided detailed
notes on its spread over the Internet, the unique signatures
in the worm’s code, the method of the worm’s attack, and a
comprehensive description of system failure points.
Although accurate, this method of analysis is expensive,
and slow. If only a small set of malicious executables will
ever circulate then this method will work very well, but
the
Wildlist
[22] is always changing and expanding. The
Wildlist is a list of malicious programs that are currently
estimated to be circulating at any given time.
Current approaches to detecting malicious programs
match them to a set of known malicious programs. The
anti-virus community relies heavily on known byte-code
signatures to detect malicious programs. More recently,
these byte sequences were determined by automatically ex-
amining known malicious binaries with probabilistic meth-
ods.
At IBM, Kephart and Arnold [9] developed a statistical
method for automatically extracting malicious executable
signatures. Their research was based on speech recognition
algorithms and was shown to perform almost as good as
a human expert at detecting known malicious executables.
Their algorithm was eventually packaged with IBM’s anti-
virus software.
Lo et al.
3
Methodology
The goal of this work was to explore a number of stan-
dard data mining techniques to compute accurate detectors
for new (unseen) binaries. We gathered a large set of pro-
grams from public sources and separated the problem into
two classes:
malicious
and
benign
executables. Every ex-
ample in our data set is a Windows or MS-DOS format ex-
ecutable, although the framework we present is applicable
to other formats. To standardize our data-set, we used an
updated MacAfee’s [16] virus scanner and labeled our pro-
[15] presented a method for filtering mali-
2
grams as either malicious or benign executables. Since the
virus scanner was updated and the viruses were obtained
from public sources, we assume that the virus scanner has
a signature for each malicious virus.
We split the dataset into two subsets: the
training set
and
the
test set
. The data mining algorithms used the training
set while generating the rule sets. We used a test set to
check the accuracy of the classifiers over unseen examples.
Next, we automatically extracted a binary profile from
each example in our dataset, and from the binary profiles
we extracted
features
to use with classifiers. In a data min-
ing framework, features are properties extracted from each
example in the data set—such as byte sequences—that a
classifier can use to generate detection models. Using dif-
ferent features, we trained a set of data mining classifiers
to distinguish between benign and malicious programs. It
should be noted that the features extracted were static prop-
erties of the binary and did not require executing the binary.
The framework supports different methods for feature
extraction and different data mining classifiers. We used
system resource information, strings and byte sequences
that were extracted from the malicious executables in the
data set as different types of features. We also used three
learning algorithms:
To evaluate the performance, we compute the false pos-
itive rate and the detection rate. The false positive rate is
the number of benign examples that are mislabeled as mali-
cious divided by the total number of benign examples. The
detection rate is the number of malicious examples that are
caught divided by the total number of malicious examples.
3.1
Dataset Description
Our data set consisted of a total of 4,266 programs split
into 3,265 malicious binaries and 1,001 clean programs.
There were no duplicate programs in the data set and every
example in the set is labeled either malicious or benign by
the commercial virus scanner.
The malicious executables were downloaded from var-
ious FTP sites and were labeled by a commercial virus
scanner with the correct class label (malicious or benign)
for our method. 5% of the data set was composed of
Trojans and the other 95% consisted of viruses. Most
of the clean programs were gathered from a freshly in-
stalled Windows 98 machine running MSOffice 97 while
others are small executables downloaded from the Inter-
net. The entire data set is available from our Web site
http://www.cs.columbia.edu/ids/mef/software/
.
We also examined a subset of the data that was in
Portable Executable
(PE) [17] format. The data set consist-
ing of PE format executables was composed of 206 benign
programs and 38 malicious executables.
After verification of the data set the next step of our
method was to extract features from the programs.
an inductive rule-based learner that generates boolean
rules based on feature attributes.
a probabilistic method that generates probabilities that
an example was in a class given a set of features.
a multi-classifier system that combines the outputs
from several classifiers to generate a prediction.
4
Feature Extraction
To compare the data mining methods with a traditional
signature-based method, we designed an automatic signa-
ture generator. Since the virus scanner that we used to la-
bel the data set had signatures for every malicious exam-
ple in our data set, it was necessary to implement a simi-
lar signature-based method to compare with the data min-
ing algorithms. There was no way to use an off-the-shelf
virus scanner and simulate the detection of new malicious
executables because these commercial scanners contained
signatures for all the malicious executables in our data set.
Like the data mining algorithms, the signature-based algo-
rithm was only allowed to generate signatures over the set
of training data. This allowed our data mining framework
to be fairly compared to traditional scanners over new data.
To quantitatively express the performance of our method
we show tables with the counts for
true positives
(TP),
true
negatives
(TN),
false positives
(FP), and
false negatives
(FN). A true positive is a malicious example that is cor-
rectly tagged as malicious, and a true negative is a benign
example that is correctly classified. A false positive is a be-
nign program that has been mislabeled by an algorithm as
a malicious program, while a false negative is a malicious
executable that has been misclassified as a benign program.
In this section we detail all of our choices of features. We
statically extracted different features that represented dif-
ferent information contained within each binary. These fea-
tures were then used by the algorithms to generate detection
models.
We first examine only the subset of PE executables using
LibBFD. Then we used more general methods to extract
features from all types of binaries.
4.1
LibBFD
Our first intuition into the problem was to extract informa-
tion from the binary that would dictate its behavior. The
problem of predicting a program’s behavior can be reduced
to the halting problem and hence is undecidable [2]. Per-
fectly predicting a program’s behavior is unattainable but
estimating what a program can or cannot do is possible. For
instance if a Windows executable does not call the User In-
terfaces Dynamically Linked Library(USER32.DLL), then
we could assume that the program does not have the stan-
dard Windows user interface. This is of course an over-
simplification of the problem because the author of that ex-
3
ample could have written or linked to another user interface
library, but it did provide us with some insight to an appro-
priate feature set.
To extract resource information from Windows executa-
bles we used GNU’s Bin–Utils [5]. GNU’s Bin–Utils suite
of tools can analyze PE binaries within Windows. In PE,
or
Common Object File Format
(COFF), program headers
are composed of a COFF header, an Optional header, an
MS-DOS stub, and a file signature. From the PE header we
used libBFD, a library within Bin–Utils, to extract informa-
tion in
object format
. Object format for a PE binary gives
the file size, the names of DLLs, and the names of function
calls within those DLLs and Relocation Tables. From the
object format, we extracted a set of features to compose a
feature vector for each binary.
To understand how resources affected a binary’s behav-
ior we performed our experiments using three types of fea-
tures:
Figure 2: Second Feature Vector: A conjunction of DLL’s
and the functions called inside each DLL
rityA(), and two functions in WSOCK32.DLL: recv() and
send().
The third approach to binary profiling (seen in Figure 3)
counted the number of different function calls used within
each DLL. The feature vector included 30 integer values.
This profile gives a rough measure of how heavily a DLL
is used within a specific binary. Intuitively, in the resource
models we have been exploring, this is a macro-resource
usage model because the number of calls to each resource
is counted instead of detailing referenced functions. For ex-
ample, if a program only called the recv() and send() func-
tions of WSOCK32.DLL, then the count would be 2. It
should be noted that we do not count the number of times
those functions might have been called.
1. The list of DLLs used by the binary
2. The list of DLL function calls made by the binary
3. The number of different function calls within each
DLL
The first approach to binary profiling (shown in Figure
1) used the DLLs loaded by the binary as features. The
feature vector comprised of 30 boolean values represent-
ing whether or not a binary used a DLL. Typically, not
every DLL was used in all of the binaries, but a major-
ity of the binaries called the same resource. For example,
almost every binary called GDI32.DLL, which is the Win-
dows NT Graphics Device Interface and is a core compo-
nent of WinNT.
Figure 3: Third Feature Vector: A conjunction of DLL’s
and a count of the number of functions called inside each
DLL
The example vector given in Figure 3 describes an exam-
ple that calls two functions in ADVAPI32.DLL, ten func-
tions in AVICAP32.DLL, eight functions in WINNM.DLL
and two functions from WSOCK32.DLL.
All of the information about the binary was obtained
from the program header. In addition, the information was
obtained without executing the unknown program but by
examining the static properties of the binary, using libBFD.
Since we could not analyze the entire dataset with
libBFD we found another method for extracting features
that works over the entire dataset. We describe that method
next.
:advapi32 ^ avicap32 ^ ::: ^ winmm^:wsock32
Figure 1:
First Feature Vector:
A conjunction of DLL
names
The example vector given in Figure 1 is composed of
at least two unused resources: ADVAPI32.DLL, the Ad-
vanced Windows API, and WSOCK32.DLL, the Windows
Sockets API. It also uses at least two resources: AVI-
CAP32.DLL, the AVI capture API, and WINNM.DLL, the
Windows Multimedia API.
The second approach to binary profiling (seen in Figure
2) used DLLs and their function calls as features. This ap-
proach was similar to the first, but with added function call
information. The feature vector was composed of 2,229
boolean values. Because some of the DLL’s had the same
function names it was important to record which DLL the
function came from.
The example vector given in Figure 2 is composed of
at least four resources. Two functions were called in AD-
VAPI32.DLL: AdjustTokenPrivileges() and GetFileSecu-
4.2
GNU Strings
During the analysis of our libBFD method we noticed
that headers in PE-format were in plain text. This meant
that we could extract the same information from the PE-
executables by just extracting the plain text headers. We
also noticed that non-PE executables also have strings en-
coded in them. We theorized that we could use this infor-
mation to classify the full 4,266 item data set instead of the
4
1f0e 0eba b400 cd09 b821 4c01 21cd 6854
7369 7020 6f72 7267 6d61 7220 7165 6975
6572 2073 694d 7263 736f 666f 2074 6957
646e 776f 2e73 0a0d 0024 0000 0000 0000
454e 3c05 026c 0009 0000 0000 0302 0004
0400 2800 3924 0001 0000 0004 0004 0006
000c 0040 0060 021e 0238 0244 02f5 0000
0001 0004 0000 0802 0032 1304 0000 030a
small libBFD data set.
To extract features from the first data set of 4,266 pro-
grams we used the GNU
strings
program. The strings
program extracts consecutive printable characters from any
file. Typically there are many printable strings in binary
files. Some common strings found in our dataset are illus-
trated in Table 1.
kernel
microsoft
windows
getmodulehandlea
Figure 4: Example Hexdump
getversion
getstartupinfoa
win
getmodulefilenamea
messageboxa
closehandle
null
dispatchmessagea
library
getprocaddress
advapi
getlasterror
5
Algorithms
loadlibrarya
exitprocess
heap
getcommandlinea
reloc
createfilea
writefile
setfilepointer
In this section we describe all the algorithms presented in
this paper as well as the signature-based method used for
comparison. We used three different data mining algo-
rithms to generate classifiers with different features: RIP-
PER, Naive Bayes, and a Multi-Classifier system.
We describe the signature-based method first.
application
showwindow
time
regclosekey
Table 1: Common strings extracted from binaries using
GNU strings
Through testing we found that there were similar strings
in malicious executables that distinguished them from
clean programs, and similar strings in benign programs
that distinguished them from malicious executables. Each
string in the binary was used as a feature. In the data mining
step, we discuss how a frequency analysis was performed
over all the byte sequences found in our data set.
The strings contained in a binary may consist of reused
code fragments, author signatures, file names, system re-
source information, etc. This method of detecting mali-
cious executables is already used by the anti-malicious ex-
ecutable community to create signatures for malicious exe-
cutables.
Extracted strings from an executable are not very robust
as features because they can be changed easily, so we ana-
lyzed another feature, byte sequences.
5.1
Signature Methods
We examine signature-based methods to compare our re-
sults to traditional anti-virus methods. Signature-based de-
tection methods are the most commonly used algorithms in
industry [27]. These signatures are picked to differentiate
one malicious executable from another, and from benign
programs. These signatures are generated by an expert in
the field or an automatic method. Typically, a signature is
picked to illustrate the distinct properties of a specific ma-
licious executable.
We implemented a signature-based scanner with this
method that follows a simple algorithm for signature gen-
eration. First, we calculated the byte-sequences that were
only found in the malicious executable class. These byte-
sequences were then concatenated together to make a
unique signature for each malicious executable example.
Thus, each malicious executable signature contained only
byte-sequences found in the malicious executable class. To
make the signature unique, the byte-sequences found in
each example were concatenated together to form one sig-
nature. This was done because a byte-sequence that is only
found in one class during training could possibly be found
in the other class during testing [9], and lead to false posi-
tives in testing.
The method described above for the commercial scanner
was never intended to detect unknown malicious binaries,
but the data mining algorithms that follow were built to de-
tect new malicious executables.
4.3
Byte Sequences Using Hexdump
Byte sequences are the last set of features that we used over
the entire 4,266 member data set. We used
hexdump
[18],
a tool that transforms binary files into hexadecimal files.
The byte sequence feature is the most informative because
it represents the machine code in an executable instead of
resource information like libBFD features. Secondly, ana-
lyzing the entire binary gives more information for non-PE
format executables than the strings method.
After we generated the hexdumps we had features as dis-
played in Figure 4 where each line represents a short se-
quence of machine code instructions.
We again assumed that there were similar instructions in
malicious executables that differentiated them from benign
programs, and the class of benign programs had similar
byte code that differentiated them from the malicious ex-
ecutables. Also like the string features, each byte sequence
in a binary is used as a feature.
5.2
RIPPER
The next algorithm we used, RIPPER [3], is an inductive
rule learner. This algorithm generated a detection model
composed of resource rules that was built to detect future
5
[ Pobierz całość w formacie PDF ]
  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • losegirl.htw.pl