Comparison of Machine Learning Algorithms to Detect RPL-Based IoT Devices Vulnerability

Table of Contents

Machine Learning Algorithms Used in Attack Detection

While making the raw data meaningful, the data set obtained from the simulation with the malicious node was labeled with 1 and the simulation with normal nodes was labeled with 0, and these two data sets were combined. This new data set will be compared with the “classification” algorithms. The definitions of machine learning algorithms to be compared are explained in this page.

Logistic Regression

For logistic regression, there must be one or more independent variables in a data set that affect the outcome. Logistic regression gives us the probability that a sample belongs to a class. Thus, logistic regression is suitable for data labeled with 1 or 0.

The mathematical computation underlying logistic regression is based on probabilities. It can be modeled with a linear equation if the independent variables at hand affect whether the result will be.

To define logistic regression, it is necessary to give information about linear regression first. To be able to classify a data set in which there is a dependent variable such as y and an independent variable such as x, it can be expressed with a linear equation as in equation 3.1.

y_i=\beta_0+\beta_1\ x_i+e_i    (3.1.)

Here, y can be expressed as the dependent variable,  as the cutoff term, and  as the coefficient(slope), as the error, and x as the independent variable. If there is more than one independent variable in the data set, the linear function will be as in equation 3.2:

y_i=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\ldots.+\beta_nx_n+\varepsilon_i ( 3.2.)

If we want the result of linear equations to be in the range of 0 to 1, it will be necessary to pass these equations through the sigmoid function. The probability of having the Y independent variable will be as in equation 3.3.

P\left(y\right)=\frac{1}{1+e^{-\left(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\ldots.+\beta_nx_n+\varepsilon_i\right)}}     (3.3.)

The logistic regression algorithm based on the operation in Equation 3.3 will be a suitable machine learning algorithm for our data set classified with 0 and 1. (Bonaccorso, 2017) 1

Decision Trees

Decision trees are a powerful algorithm that creates tree structure and branching according to the coefficient of influence of the independent variables that affect the classification in the data set. First, the strongest feature in the data set is determined, and then one of the branches is selected. This procedure is repeated until the searched classification target is reached.

If the selected independent variables are defined as in equation 3.4:

\sigma=\left\langle i;t_k\right\rangle (3.4)

The first element here is the index of the feature we want to use to split the dataset at a particular node (it will only be the entire dataset at the beginning; the number of samples decreases after each step), and the second is the threshold that determines the left and right branches. The choice of the best threshold determines the structure of the tree and therefore is a key factor in determining its performance. The aim is to reduce residual impurity in the least number of divisions to have a very short decision path between the sample data and the classification result. Also, taking into account the two branches, a measure of total impurities is defined as follows:

I\left(D,\sigma\right)=\frac{Nleft}{N_D}I\left(Dleft\right)+\frac{Nright}{N_D}I\left(Dright\right) (3.5.)

Here, D is the entire dataset at the selected node, Dleft and Dright are the resulting subsets (applying the selection bundle), and I are the impurity measures.

The Gini impurity index is defined as in equation 3.6:

I_{Gini}=\sum{p\left(i\middle| j\right)\left(1-p\left(i\middle| j\right)\right)} (3.6.)

Here, the sum is always extended to all classes. Given a data set, Gini impurity measures the probability of misclassification if a random label is chosen using the branch’s probability distribution. The index reaches a minimum (0.0) when all instances of a node are classified in a single category.

The cross-entropy measure is defined as in equation 3.7:

I_{Cross-Entropi}\left(j\right)=-\sum_{i}{p\left(i\middle| j\right)\log{p\left(i\middle| j\right)}}     (3.7.)

This measure is based on information theory and is maximum when there is a uniform distribution between classes (this is one of the worst cases in decision trees because it means there are still many decision steps up to final classification) while samples belonging to only one class are present in a split, while null values. This index is very similar to Gini impurity, but more formally, cross-entropy allows you to choose discrimination that minimizes uncertainty about classification, while Gini impurity minimizes the possibility of misclassification. (Bonaccorso, 2017) 1

Decision trees will be a suitable classification algorithm for the obtained data set.

Random Forest

A random forest is a series of decision trees built on random samples with a different policy to split a node: Instead of looking for the best option in such a model, a random subset of features (for each tree) is used. Finds the threshold that best separates the data. As a result, there will be many more poorly trained trees, each producing a different estimate. There are two ways to interpret these results; the more common approach is based on majority voting (the class with the most votes will be considered correct). Even if they are different in theory, the probability mean of a trained random forest cannot be very different from most of the predictions (otherwise there must be different fixed points); therefore the two methods often lead to comparable results.

The concept of feature importance can also be applied to random forests as in equation 3.8 by calculating the average of all trees in the forest: (Bonaccorso, 2017) 1

Importance\ (x_i)\ =\ \frac{1}{N_{Trees}}\ \sum\sum\frac{N_k}{N}∆\ I x_i\ (3.8)

Random Forest is also a suitable machine learning algorithm to train our dataset.

Navie Bayes

Naive Bayes is a family of powerful and easy-to-train classifiers that determine the probability of an outcome given a set of conditions using Bayes’ theorem. According to Bayes’ theorem, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. According to Bayes’ theorem, in two-probability events A and B, a correlation as in equation 3.9 using the multiplication rule with the marginal probabilities P (A) and P (B) and the conditional probabilities P (A | B) and P (B | A). can be installed.

P (A∩B) = P(A|B) P(B)P

  (B∩A) = P(B|A) P(A)

(3.9.)

If the intersection is considered commutative, the first members are equal; so Bayes’ theorem can be derived as in equation 3.10:

P\left(A\middle| B\right)=\frac{P\left(B\middle| A\right)P\left(A\right)}{P\left(B\right)}      (3.10.)

This formula has profound philosophical implications and is an essential element of statistical learning. (Bonaccorso, 2017) 1

The naive bayes algorithm is also a suitable machine learning algorithm for our dataset.

KNN-K Nearest Neighbour

The KNN algorithm assumes that similar classes are close to each other. The positive aspects of this algorithm can be listed as being easy to implement, resistant to noisy data, and producing more effective results in large training sets. However, the negative side of the algorithm is that it needs the “k” parameter at the beginning, it is not clear which distance measurement unit to choose to obtain the best result, and the computational cost is high. (Bilgin & Yılmaz, 2018) 2 In other words, similar things are close to each other. The nearest neighbors of each point are calculated by simple majority vote. Usually this is the majority Euclidean distance. (Equation 3.11)

\left|AB\right|=\sqrt{\left(x_2-x_1\right)^2+\left(y_2-y_1\right)^2} (3.11.)

A query point is assigned to the data class that has the most representatives among the point’s nearest neighbors. The nearest neighbors of each query point are here an integer value k specified by the user. Here, the optimum choice of k value is important and depends on the data. The larger the integer value of k, the higher the noise, but the more specific the classification becomes. (Bonaccorso, 2017) 1

The KNN classification algorithm is a suitable classification algorithm for the data set.

Artificial Neural Networks

An artificial neural network (ANN) is a directed structure that connects an input layer to an output. The general vector function of artificial neural networks is as in equation 3.12:

y\ \bar=f(x\ \bar\ )     (3.12.)

Here,

\bar{x}=\left(x_1+x_2+x_3+,\ldots,+x_n\right)\ ve\ \bar{y}=\left(y_1+y_2+y_3+,\ldots,+y_m\right)     (3.13.)

The concept of “neural” consists of two important elements: the internal structure of the basic unit of calculation and the connections between them. In general, the schematic representation of artificial neural networks is given in Figure 3.8.

Figure 3.8 Schematic Representation of Neural Networks.

A neuron nucleus is connected to n input channels, each of which is characterized by a synaptic weight wi. The input is divided into its components and multiplied and added together by the corresponding weight. This addition can be added to an optional bayes (it works like any other weight attached to a unitary input). The resulting sum is filtered by the activation function fa and the output is generated. It is necessary to create the first Multilayer Sensor to follow a different and more effective path than the linear model, such as artificial neural networks, logistic regression.

The following figure also shows a schematic representation of a Multilayer sensor with an n-dimensional input, p latent neurons, and a k-dimensional output:

Figure 3.9 Representation of Multidimensional Artificial Neural Networks.
 
Figure 3.9 shows 4 layers. Input layers  that receives the input vectors X1 .. Xn; the two hidden layers; and the output Y 0….Yk the output layer, which is responsible for producing the  output. As shown in the figure, each neuron is connected to all the neurons belonging to the next layer and there are 3 weight matrices. As can be seen, each neuron is connected to all the neurons of the next layer, and there are three weight matrices, Wij, Hij and Mij, using the rule that the first index refers to the previous layer and the second to the next layer.
 
Therefore, the net input and corresponding output value of each latent neuron are calculated as in equation 3.14.
 
Likewise, the network output is calculated as in equation 3.15:
As can be seen, the network has been made highly non-linear and this feature has allowed the modeling of complex scenarios that are impossible to manage with linear methods. However, the determination of all synaptic weights and bayesian values is accomplished with fa(x) being differentiable.
 
First, an error (loss) function is defined as the total square of error and this process is applied to all layers. (Equation 3.16)
This function depends on all variables (weights and bayesian values), then a backward calculation is performed. This process is the calculation of how much the change is, that is, the derivative function. For a single layer, this function is as in the equation. This equation is extended to all layers. (Equation 3.17)

Here, the alpha term (proportional to the error delta) propagates back from the output layer to the hidden one. If there are many hidden layers, this procedure is repeated iteratively up to the first layer. The algorithm adopts the gradient descent method; so it iteratively updates the weights until convergence. Thus, machine learning is performed with the data in hand. (Bonaccorso, 2017) 1

Artificial neural networks allow us to make very complex classifications. It is a very powerful algorithm. The training time in the artificial neural network alone is longer than in other algorithms. It requires more processing. It is very likely to learn the data set obtained in this thesis and detect the attack. However, the purpose of this thesis is to determine a simple and fast method.

References

1. Bonaccorso, G. (2017). Machine Learning Algorithms. Birmingham: Packt Publishing Ltd.

2. Bilgin, M., & Yılmaz, A. (2018). Makine Öğrenmesi. İstanbul: Papatya Bilim.

Blog summary

While making the raw data meaningful, the data set obtained from the simulation with the malicious node was labeled with 1 and the simulation with normal nodes was labeled with 0, and these two data sets were combined. This new data set will be compared with the "classification" algorithms. The definitions of machine learning algorithms to be compared are explained in this page.

About the Author

Other Posts

My Thesis
Murat Ugur KIRAZ

Conclusion

In this blog post, the Flooding Attack, Decreased Rank Attack and Version Number Increase Attack in the RPL protocol were trained and detected by “Decision Tree”, “Logistic Regression”, “Random Forest”, “Naive Bayes”, “K Nearest Neighbor” and “Artificial Neural Networks” algorithms.

The test results for the attacks were compared, as a result of the comparison, the Artificial Neural Networks algorithm with an accuracy rate of 97.2% in the detection of Flooding Attacks, the K Nearest Neighbor algorithm with an accuracy rate of 81% in the detection of Version Number Increase Attacks, and the Artificial Neural Networks with an accuracy rate of 58% in the detection of Decreased Rank attacks algorithm has been found to show success.

Read More »
My Thesis
Murat Ugur KIRAZ

Interpretation of Machine Learning Values

I continue to share how I did my master’s thesis titled Comparison of Machine Learning Algorithms for the Detection of Vulnerability of RPL-Based IoT Devices, my experiences in this process, and the codes in this thesis in a series of articles on my blog.

So far, I have provided detailed information about the RPL protocol and the attacks that take place in the RPL protocol. Then, I experimented with Flooding Attacks, Version Number Increased Attack, and Decreased Rank Attack, extracting the raw data and making sense of that raw data. I compared the results of experiments with weak knots with statistical methods.

In this section, I will interpret the numerical results of the attacks we detect with machine learning algorithms.

Read More »
My Thesis
Murat Ugur KIRAZ

Statistical Analysis

I explained my master’s thesis titled ” Comparison of Machine Learning Algorithms for the Detection of RPL-Based IoT Devices Vulnerability” by using this blog page. So far, I have provided detailed information about the RPL protocol and the attacks that take place in the RPL protocol. Then, I experimented with Flooding Attack, Version Number Increase Attack, and Decreased Rank Attack, extracting the raw data, and making meaning of that raw data. In this section, I will compare the results of experiments with weak motes with statistical methods. Statistical methods will tell us if machine learning methods are working properly.

Read More »
My Thesis
Murat Ugur KIRAZ

Experiments and Experiment Results

I continue to share how I did my master’s thesis titled Comparison of Machine Learning Algorithms for the Detection of Vulnerability of RPL-Based IoT Devices, my experiences in this process, and the codes in this thesis in a series of articles on my blog.

In this article, we will train the processed data and detect the attack with machine learning algorithms in the RPL protocol.

Read More »
My Thesis
Murat Ugur KIRAZ

Machine Learning Algorithms Used in Attack Detection

While making the raw data meaningful, the data set obtained from the simulation with the malicious node was labeled with 1 and the simulation with normal nodes was labeled with 0, and these two data sets were combined. This new data set will be compared with the “classification” algorithms. The definitions of machine learning algorithms to be compared are explained in this page.

Read More »
My Thesis
Murat Ugur KIRAZ

Making Raw Data Meaningful

The information obtained from the raw data set will not be enough to apply machine learning. The raw data obtained from simulations containing weak nodes is completely different from the raw data obtained from simulations containing normal motes. It has been observed that this difference is the number of packets, message types, total packet lengths and rates. To detect this anomaly, the raw data is divided into 1-second frames. Within frames of each second, the following values were calculated, and a new data set was created.

Read More »

Share this post

LinkedIn
Twitter