Data mining is the process of identifying patterns in large datasets. Data mining techniques are heavily used in scientific research (in order to process large amounts of raw scientific data) as well as in business, mostly to gather statistics and valuable information to enhance customer relations and marketing strategies.
Data mining has also proven a useful tool in cyber security solutions for discovering vulnerabilities and gathering indicators for baselining.
In this article, we take a closer look at the role of data mining in information security and the malware detection process.
What is data mining? In general, it is a process that involves analyzing information, predicting future trends, and making proactive, knowledge-based decisions based on large datasets.
While the term data mining is usually treated as a synonym for Knowledge Discovery in Databases (KDD), it’s actually just one of the steps in this process. The main goal of KDD is to obtain useful and often previously unknown information from large sets of data.
The entire KDD process includes four steps:
- Pre-processing – selecting, cleaning, and integrating data
- Transformation – transforming information and consolidating it into forms appropriate for mining
- Mining – collecting, extracting, analyzing, and statistically processing data
- Pattern evaluation – identifying new and unusual patterns and presenting the knowledge gained from data mining
Data mining helps you find new interesting patterns, extract hidden (yet useful and valuable) information, and identify unusual records and dependencies from large databases. To obtain valuable knowledge, data mining uses methods from statistics, machine learning, artificial intelligence (AI), and database systems.
In recent years, many IT industry giants such as Comodo, Symantec, and Microsoft have started using data mining techniques for malware detection.
Many methods are used for mining big data, but the following eight are the most common:
- Association rules help find possible relations between variables in databases, discover hidden patterns, and identify variables and the frequencies of their occurrence.
- Classification breaks a large dataset into predefined classes or groups.
- Clustering helps identify data items that have similar characteristics and understand similarities and differences among data.
- The decision tree technique creates classification and regression models in the form of a tree structure.
- The neural network technique is used to model complex relationships between inputs and outputs and to discover new patterns.
- Regression analysis is used for predicting the value of one item based on the known value of other items in a dataset by building a model of the relationship between dependent and independent variables.
- Statistical techniques help find patterns and build predictive models.
- Visualization discovers new patterns and shows the results in a way that is comprehensible for users.
You can apply one or several data mining methods to create an efficient model that will ensure successful detection of attacks.
Next, we’ll take a closer look at how you can use data mining for cyber security solutions.
Data mining is one of the four detection methods used today for detecting malware. The other three are scanning, activity monitoring, and integrity checking.
When building a security app, developers use data mining methods to improve the speed and quality of malware detection as well as to increase the number of detected zero-day attacks.
Malware detection strategies
There are three strategies for detecting malware:
- Anomaly detection
- Misuse detection
- Hybrid detection
Anomaly detection involves modeling the normal behavior of a system or network in order to identify deviations from normal usage patterns. Anomaly-based techniques can detect even previously unknown attacks and can be used for defining signatures for misuse detectors.
The main problem with anomaly detection is that any deviation from the norm, even if it is a legitimate behavior, will be reported as an anomaly, thus producing a high rate of false positives.
Misuse detection, also known as signature-based detection, identifies only known attacks based on examples of their signatures. This technique has a lower rate of false positives but can’t detect zero-day attacks.
A hybrid approach combines anomaly and misuse detection techniques in order to increase the number of detected intrusions while decreasing the number of false positives. It doesn’t build any models, but instead uses information from both harmful and clean programs to create a classifier – a set of rules or a detection model generated by the data mining algorithm. Then the anomaly detection system searches for deviations from the normal profile and the misuse detection system looks for malware signatures in the code.
When using data mining, malware detection consists of two steps:
- Extracting features
In the first step, various features such as API calls, n-grams, binary strings, and program behaviors are extracted statically and dynamically to capture the characteristics of the file samples. Feature extraction can be performed by running static or dynamic analysis (with or without actually running potentially harmful software). A hybrid approach that combines static and dynamic analysis may also be used.
During classification and clustering, file samples are classified into groups based on feature analysis. To classify samples, you can use classification or clustering techniques.
To classify file samples, you need to build a classification model (a classifier) using classification algorithms such as RIPPER, Decision Tree (DT), Artificial Neural Network (ANN), Naive Bayes (NB), or Support Vector Machines (SVM). Clustering is used for grouping malware samples that have similar characteristics.
Using machine learning techniques, each classification algorithm constructs a model that represents both benign and malicious classes. Training a classifier using such file sample collection makes it possible to detect even newly released malware.
Note that the effectiveness of data mining techniques for malware detection critically depends on the features you extract and the categorization techniques you use.
Aside from detecting malware code, data mining can be effectively used to detect intrusions and analyze audit results to detect anomalous patterns. Malicious intrusions may include intrusions into networks, databases, servers, web clients, and operating systems.
There are two types of intrusion attacks you can detect using data mining methods:
- Host-based attacks, when the intruder focuses on a particular machine or a group of machines
- Network-based attacks, when the intruder attacks the entire network (for instance, causing a buffer overflow
To detect host-based attacks, you need to analyze features extracted from programs, while to detect network-based attacks, you need to analyze network traffic. And just like with malware detection, you can look for either anomalous behavior or cases of misuse.
You can detect various types of fraud using data mining techniques, from financial fraud to telecommunications fraud and computer intrusions. Fraudulent activities can be detected with the help of supervised and unsupervised learning.
With supervised learning, all available records are classified as either fraudulent or non-fraudulent. This classification is then used for training a model to detect possible fraud. The main drawback of this method is its inability to detect new types of attacks. Unsupervised learning methods help identify privacy and security issues in data without using statistical analysis.
Using data mining in cyber security lets you
- process large datasets faster;
- create a unique and effective model for each particular use case;
- apply certain data mining techniques to detect zero-day attacks.
While this list of the benefits is impressive, there are also certain drawbacks you need to know about:
- Data mining is complex, resource-intensive, and expensive
- Building an appropriate classifier may be a challenge
- Potentially malicious files need to be inspected manually
- Classifiers need to be constantly updated to include samples of new malware
- There are certain data mining security issues, including the risk of unauthorized disclosure of sensitive information
Data mining helps you quickly analyze huge datasets and automatically discover hidden patterns, which is crucial when it comes to creating an effective anti-malware solution that’s able to detect previously unknown threats. However, the final result of using data mining methods always depends on the quality of data you use.
When using data mining in cyber security, it’s crucial to use only quality data. However, preparing databases for analysis requires a lot of time, effort, and resources. You need to clear all your records of duplicate, false, and incomplete information before working with them. Lack of information or the presence of duplicate records or errors can significantly decrease the effectiveness of complex data mining techniques. Only using accurate and complete data can ensure high quality of analysis.
Data mining has great potential as a malware detection tool. It allows you to analyze huge sets of information and extract new knowledge from it.
The main benefit of using data mining techniques for detecting malicious software is the ability to identify both known and zero-day attacks. However, since a previously unknown but legitimate activity may also be marked as potentially fraudulent, there’s the possibility for a high rate of false positives.