A Hybrid Feature Selection Approach based on Random Forest and Particle Swarm Optimization for IoT Network Traffic Analysis

.co.in A Hybrid Feature Selection Approach based on Random ░ ABSTRACT - The complexity and volume of network traffic has increased significantly due to the emergence of the “Internet of Things” (IoT). The classification accuracy of the network traffic is dependent on the most pertinent features. In this paper, we present a hybrid feature selection method that takes into account the optimization of Particle Swarms (PSO) and Random Forests. The data collected by the security firm, CIC-IDS2017, contains a large number of attacks and traffic instances. To improve the classification accuracy, we use the framework's RF algorithm to identify the most important features. Then, the PSO algorithm is used to refine the selection process. According to our experiments, the proposed method performed better than the other methods when it comes to the classification accuracy. It achieves a ~99.9% accuracy when using a hybrid of Random Forest and PSO. The hybrid approach also helps improve the model's performance. The suggested method can be utilized by security analysts and network administrators to identify and prevent attacks on the IoT.

The emergence of the IoT has transformed how people work and live.It allows "devices", "sensors", "software" and "objects" to interact with each other and exchange data without human intervention, providing them with new opportunities to perform actions and improve their efficiency [1].The complexity of managing and analyzing the data generated by such devices has increased due to the widespread adoption.The classification and analysis of the traffic generated by the IoT network is very important for ensuring its security and efficiency.Unfortunately, traditional methods are not able to handle the unique challenges of this type of traffic.In order to effectively address these issues, new techniques are required [2]- [4].
Due to the proliferation of devices and applications using the IoT, the amount of data collected by these devices has become complex.This has prompted the development of methods that can analyze and process the data.One of these is machine learning (ML), which can be used to classify the traffic in the network [5].The classification of network traffic is a process that involves identifying the various types of traffic that are sent and received by the IoT.This information can then be used to enhance the performance of network-dependent applications.
Utilizing ML, one can analyze and forecast network traffic conditions.
The accuracy of ML techniques depends on the type of data they are dealing with and the appropriate features.The highdimensionality of the IoT network traffic data makes it challenging to select the optimal subset of features [6].Traditional methods, such as feature elimination and correlation, may not be able to effectively pick the right features.Due to the complexity of the data collected by the IoT network, the development of hybrid models has been carried out.These allow for the use of different feature selection techniques to improve the classification accuracy.
Through the use of ML, which can learn from vast amounts of data, traffic classification can be performed on the IoT network.It can help identify relationships and patterns that are not easily detected by humans.For instance, by analyzing the features of traffic such as its protocol type and destination address, ML can help identify malicious and anomalous traffic.The success of traffic classification using ML depends on the selection of the most relevant features.This step involves reducing the overall data dimensionality and removing redundant or irrelevant features.There are various techniques that can be used for feature selection, such as embedded, filter, and wrapper [7], [8].
This paper introduced a hybrid feature selection method that combines the use of particle swarm optimization and random forest techniques.The method will help improve the accuracy of the classification process by identifying the most suitable subset of features.The powerful RF algorithm can be used to measure the importance of various features by calculating the reduction in impurities that can be achieved by splitting them.The top-ranking features are then selected for the classification process.The PSO method is then used to perform iteratively the search for the optimal feature subset.This method can help reduce the overall dimensionality of the data and improve the accuracy [9], [10].
The proposed method is evaluated on the IoT network traffic analysis dataset known as the Customer Information and Compliance Data 2017 (CIC-IDS2017).It has about 2.8 million network flows and 80 features, including destination IP, payload, source IP, and protocol.The evaluated method is compared with various other ML techniques, such as the "Random Forest", Naïve Bayes", "AdaBoost" and "Ensemble methods".The results of the evaluation revealed that the proposed method is significantly better than the traditional methods when it comes to identifying the optimal features for the IoT network traffic analysis dataset.It has a 99.5% accuracy, which surpasses the accuracy of other techniques.It also demonstrated that it can identify various types of network traffic.
The classification of network traffic for the Internet of Things (IoT) has become more challenging due to the increasing complexity of the data collected.This is because the traditional methods of choosing features are not able to handle the dynamic and high-dimensional nature of the data.Furthermore, the number of traffic instances and attacks in the network data has also made the process more complex.Instead of using traditional methods, we use a hybrid approach that combines the power of particle swarm optimization and random forest techniques.This method can effectively identify the most critical features of the collected data by taking into account the complex relationships and high dimensionality.Integrating the PSO algorithm into the process can improve the accuracy of the classification by allowing it to select features more precisely.The global optimization capabilities of PSO also help improve the performance of the system.Through the combination of PSO and RF, our method can achieve a classification accuracy of almost 99.9%.The hybrid approach adopted by our system provides several advantages over the traditional methods.Firstly, it takes into account the unique features of the network traffic and handles its volume and complexity.Secondly, by integrating PSO and RF into the process, it can pick suitable features by taking into account both the high-dimensionality and complex relationships of the data.Our proposed method is able to achieve a superior performance and surpass the current standards.The advantages of our system are numerous, such as its ability to provide network administrators and security analysts with a high-level of accuracy in identifying and preventing attacks on the networks of IoT devices.It can also help improve the overall security of the IoT systems by protecting them from potential threats.
The proposed method can be used for various applications, such as traffic engineering and intrusion detection.It can also be used to identify anomalies and patterns in the traffic data collected by the IoT network.It can additionally help improve the performance of the network applications by identifying malicious traffic.The proposed method is a hybrid feature selection method that combines the use of PSO and RF algorithms.It can help improve the accuracy of the classification process by identifying the most suitable subset of features.It can also reduce the overall dimensionality of the data and remove redundant and noisy features.The proposed method is better than the traditional methods when it comes to identifying the optimal features for the IoT network traffic analysis dataset.

░ 2. RELATED WORK
Due to the growing number of Internet of Things devices (IoT) applications and devices, network analysis has become a vital part of ensuring the security and optimal management of the network.Traditional tools are not able to handle the massive amount of traffic that is flowing through the network.In order to accurately classify and analyze the traffic, new techniques have to be developed.J.Mocnej et al. [11] explores the various characteristics of the network traffic of IoT applications, such as smart homes and healthcare.The authors were able to identify the trends and patterns in the data collected from these devices.They also found that the different applications have their own unique network traffic patterns.This paper aims to provide a comprehensive analysis of the various characteristics of the network traffic that is flowing through the IoT.It will also help develop better ML models for analyzing and classifying the traffic.D. H. Hoang et al. [12] proposed a method that is based on the principal component analysis method to detect network traffic anomalies.They used a combination of PCA and thresholdbased techniques to analyze the data.The results of the study were presented in the paper.Y. Amar et al. [13] analyzed the network traffic data collected from various IoT devices inside a home setting.They discovered that the different kinds of devices exhibited varying behavior and patterns.This paper serves as a valuable guide for developing ML models for analyzing and classifying the traffic of IoT devices.F. C. Kuo et al. [14] present an analysis of the traffic management system that can be implemented using LTE wireless access to enhance the performance of IoT applications.They also tested the proposed scheme's effectiveness.P. Kuppusamy et al. [15] introduce a framework that aims to improve the performance of data processing and traffic control in IoT systems.They tested the framework's effectiveness by implementing it in various IoT applications.M. R. Shahid et al. [16] develop a method that can identify the various types of devices that are part of the IoT ecosystem using the network traffic characteristics of their networks.The authors utilized ML techniques in order to perform a comprehensive analysis of the data collected by the various IoT devices.They were able to extract various features from the data, such as the "packet length", "packet arrival time", and "inter-arrival timing".The researchers used various ML techniques, such as Random Forests and Decision Trees, to classify the different kinds of devices used in the study.They found that Random Forests performed better than the other algorithms.P. Gowtham et al. [17] proposed an approach that uses IoT and GPS technologies to monitor the real-time traffic clearance of emergency vehicles.This method would allow them to predict the arrival time of the vehicles at their destination.The authors created a system that uses a microcontroller, a GPS module, and a wireless communication device to attach to an emergency vehicle.The system can track the vehicle's location and send this data to a remote server.The data collected by the remote server is then used to calculate the expected arrival time of the car at the destination.It then sends the relevant authorities an alert if the estimated time gets delayed.B. Charyyev et al. [18] proposes an approach that can be used to classify the events that happen in the network traffic of IoT applications.The researchers collected a dataset of the traffic from various IoT devices.They then used ML techniques to analyze the data.The methods used in this study include Naive Bayes, Random Forests, and Decision Trees.The researchers found that the Random Forests algorithm performed better than the other algorithms when it came to identifying six different IoT events.The suggested method can be utilized to develop effective event detection systems for smart cities and homes.B. Mohammed et al. [19] presents an edge computing framework that can be used to classify network traffic in the IoT (IoT).It uses ML techniques to identify the most relevant features and improve its accuracy.A. K. M. Al-Qurabat et al. [20] proposes a method for managing the traffic in the data pipeline of smart agriculture using multidimensional description length (MDL) compression.This reduces the amount of information that is transmitted over the network and helps minimize communication overhead and energy consumption.The paper presents a novel method for identifying the presence of IoT (IoT) devices in a network traffic through capsule networks.The suggested approach is able to achieve high accuracy even in scenarios with traffic congestion and noise.
Y. Ashibani et al. [21] presents a framework for establishing user authentication on IoT networks using app event patterns.It uses ML techniques to identify and authenticate users based on abnormal app events.The study's findings indicate that the model is highly accurate at identifying individuals.H. Azath et al. [22] proposes a method that uses capsule networks equipped with AI to identify IoT devices from a network traffic.The suggested approach can achieve high accuracy in detecting objects in complex environments such as when there is traffic congestion or noise.H. Gebrye et al. [23] present a method for extracting and labeling traffic data from an IoT network.This method can be used in developing ML techniques to detect attacks.R. R. Chowdhury et al. [24] present a deep learning method that can be used to identify the presence of IoT devices in a network traffic.It is able to perform high-precision calculations even in scenarios involving traffic congestion and noise.P. Khandait et al. [25] proposes a method that can classify IoT network traffic by identifying the specific keywords used by the devices.This method uses ML techniques to improve its accuracy and reduce computational resources.The presented papers discuss the various techniques that can be used to manage the traffic in an IoT network.They use deep learning and ML to analyze and classify the data.In addition, they can be used to increase the network's efficiency by implementing techniques in edge devices.The papers present an overview of the most recent techniques and approaches utilized in the classification and analysis of IoT (IoT) traffic.They highlight the crucial factors such as ML algorithms and feature selection that are crucial in ensuring the accuracy of the data.The proposed methods and techniques can be utilized to develop robust and efficient systems for a wide range of applications, including event detection.

░ 3. FEATURE SELECTION
In the development of ML-based systems, the selection of features is a crucial step.It can aid in reducing the dimensionality of the data and enhancing the precision of the analysis [26], [27].There are various methods that can be used for this process, such as embedded, wrapper, and filter.In this section we will talk about two of these: the Gini Index and the hybrid approach.

Gini Index
The Gini index is a standard feature selection method that takes into account the uncertainty and degree of randomization in a set of elements.The measure is known as the Gini impurity, and it shows how often elements from the set are incorrectly labeled.The index is computed by taking into account the sum of the classes' square probability.In algorithms such as Random Forest, the Gini index can be used to select features.In RF, a set of decision trees is trained on various subsets of the data to produce a final prediction.The method can also measure the importance of a feature by calculating the reduction in the impurities.
The selection process in RF can be affected by the redundancy or noisy features in the data.This paper discussed about a hybrid strategy that combines characteristics of multiple methods.Based on the "Particle Swarm Optimization" and RF algorithms.Particle swarm optimization (PSO) is inspired by the "social behaviour of schooling fish and flocking birds".It can be used to perform optimization on a cost function by continuously adjusting a set of particle samples in the space.Each of these samples represents a potential solution, and its positions are updated according to its neighbors' positions.In the classification task space, it has been shown that this method can help improve the accuracy and efficiency of the process.

RF-PSO
The proposed PSO-RF hybrid feature selection method combines the two concepts to select the optimal subsets of features that will enhance the analysis's precision.The method uses RF to rank the various features according to their importance.The top-ranked ones are then selected as the starting point of the selection process.The selection process is then repeated until the optimal feature is found.It uses the PSO method to continuously adjust the feature subset according to the classification accuracy.This paper proposes a hybrid approach to feature selection that combines the advantages of the PSO and RF methods.It can reduce the data's overall dimensionality and eliminate redundant or noisy features.By optimizing the feature subset, it can also increase the classification's accuracy.Feature selection is a crucial step in ML-driven analysis, and various techniques can be utilized to find the ideal subset of the data.One of the most common methods used to select features is the Gini index.In addition to being used for the selection process, the RF technique can also rank the importance of the various features.Hybrid techniques that combine multiple approaches can lead to better results.The proposed method for feature selection combines the PSO and RF techniques to improve the accuracy of the process and enable ML analysis to perform better as shown in table 2.

Dataset
The proposed framework for analyzing the traffic patterns of the IoT (IoT) network using a combination of particle swarm optimization and random forest techniques is evaluated in the CIC-IDS 2017 dataset as shown in figure 1 [28].The data set consists of over 2.8 million network flows and includes 80 features.The dataset includes various network traffic features, such as source and destination IP addresses, source and destination ports, protocol type, flow duration, packet and byte counts, and others.

Pre-processing
Prior to implementing the proposed framework, the data must undergo pre-processing to ensure its quality.This process involves removing missing or inconsistent information, as well as encoding certain features.Strictly removing such data can result in inaccurate predictions or biased results.Inconsistent information can also affect a result's accuracy.The data is encoded using one-hot mode, which makes it easy to process by ML tools.The encoded features are then transformed into binary vectors, which represent the categories of the data as shown in figure 2. The first group of features that are ranked highly in the importance scale are selected for the PSO algorithm.The algorithm then performs a series of iterative updates to find the optimal subset.It takes into account the accuracy of the classification to find the most suitable segment.The training model is then trained using the selected subset of features.The classification process is carried out using the Support Vector Machine algorithm, a well-known ML framework for various applications.

Evaluation metrices
The proposed framework's performance is evaluated using various performance metrics, such as "accuracy, recall, F1score, and precision".The accuracy metric is used to measure the overall correctness of a classification's results, while the recall and precision measure the correctness of the classifications.The F1 score is a harmonic value that balances the two metrics.The proposed framework is compared with other methods used for feature selection, such as the recursive elimination method and the correlation-based method.
The proposed framework is compared against the other methods with the help of the performance metrics provided above.The results indicate that the hybrid approach, which combines the PSO and RF methods, performed better than the traditional methods in terms of accuracy, recall, F1 score, and precision.The proposed framework was able to achieve an accuracy of almost a hundred percent, which is significantly better than the accuracy of the other methods.Its high recall and recall performance also show that it can identify various types of network traffic.))…….eq.1 Where,   =weight assigned to t th weak classifier and sign=sign function that returns +1 or -1 depending on the sign of its argument.The   are computed based on the accuracy of the weak classifier ℎ  on the training set.

Random Forest
Another popular algorithm for ensemble learning is Random Forest, which combines several decision trees to produce a strong classifier.It takes into account the features and instances of each tree to improve its generalization and reduce correlation.The algorithm then combines the results of the voting to come up with a final prediction.

Naive Bayes
The Naive Bayes algorithm is a type of probabilistic classification that takes into account the probability of a given class of instances.It assumes that the various features are independent of one another and the class.

░ 4. RESULTS AND OUTPUTS
The evaluation parameters of various algorithm with Gini Index and RF+PSO feature selection techniques as shown in figure-3,4,5,6.feature selection method using RF and PSO appears to be effective in selecting relevant features for classification.

░ 5. CONCLUSION AND FUTURE SCOPE
In this paper, a hybrid feature selection approach based on Random Forest (RF) and Particle Swarm Optimization (PSO) was proposed for IoT network traffic analysis.The aim was to improve the classification accuracy of IoT network traffic data by selecting relevant features using the proposed approach.The study compared the performance of four classification algorithms: AdaBoost Classifier, Random Forest (RF), Naive Bayes (NB), and Hybrid Ensemble.The results showed that the proposed approach was effective in selecting relevant features, and the Hybrid Ensemble method achieved the highest accuracy of 99.98%.The study demonstrates the potential of using AI and ML techniques for IoT network traffic analysis.The proposed approach can be useful in detecting anomalies and malicious traffic in IoT networks, thereby enhancing the security of IoT systems.Future research can explore the potential of other feature selection methods and classification algorithms for IoT network traffic analysis.Additionally, the study can be extended to explore the use of deep learning techniques for IoT network traffic analysis.Finally, the proposed approach can be tested on other IoT datasets to validate its effectiveness and generalizability.

Figure 2 :
Figure 2: Encoded dataFollowing preprocessing, the data are separated into the training and testing sets.The latter is used to train a machine learning model, while the former is used to assess its performance.The training set comprises 70% of the data, while the testing segment accounts for 30%.The framework is designed to rank the various features in the data set using the importance scores generated by the training algorithm and the PSO method to select the most accurate subset of them.The importance score for each feature is computed by comparing the reduction in impurities achieved by splitting the data.

Table 1
represent major related work in table view.