cic-ids-2018-intrusion-detection-classification
                                
                                 cic-ids-2018-intrusion-detection-classification copied to clipboard
                                
                                    cic-ids-2018-intrusion-detection-classification copied to clipboard
                            
                            
                            
                        Baseline experiments on training a Decision Tree Classifier and a Random Forest Classifier using Grid Search with Cross Validation on the CIC IDS 2018 dataset for training Machine Learning network int...
cic-ids-2018-intrusion-detection-classification
Operating System: Ubuntu 18.04 (you may face issues importing the packages from the requirements.yml file if your OS differs).
Baseline experiments on training a Decision Tree Classifier and a Random Forest Classifier using Grid Search with Cross Validation on the CIC IDS 2018 dataset (official website) [1] for training Machine Learning network intrusion detection classifier models.
I downloaded the Processed Traffic Data for ML Algorithms CIC IDS 2018 dataset via aws s3 sync --no-sign-request --region <your-region> "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" <your-dest-dir> . After exploring the data I decided to only use the "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv" and "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv" as the other files had extremely imbalanced labels which would require Anomaly Detection methods. The chosen files contained the following labels:
- Benign
- Bot
- DoS attacks-SlowHTTPTest
- DoS attacks-Hulk
The "Bot", "DoS attacks-SlowHTTPTest", and "DoS attacks-Hulk" labels were combined into one "Malicious" label. After removing some columns, missing values, and duplicate records the processed dataset "processed_friday_dataset.csv" ended up with 1,074,342 "Benign" records and 290,089 "Malicious" records. The dataset was later split into a 70/30 train/test split which a decision tree classifier and random forest classifier were trained using Grid Search with 5-fold Cross Validation.
Model results on test set
1- Decision Tree Classifier
- Accuracy: 99.982%
- Macro Average Precision: 99.969%
- Macro Average Recall: 99.978%
- Macro Average F1-Score: 99.974%

2- Random Forest Classifier
- Accuracy: 99.904%
- Macro Average Precision: 99.787%
- Macro Average Recall: 99.926%
- Macro Average F1-Score: 99.856%

Dataset
- CIC-IDS-2018 Processed Traffic Data for ML Algorithms on my Google Drive: Drive
- Processed Friday dataset "processed_friday_dataset.csv" that was used in the baseline experiments: Drive
References
[1] Sharafaldin, Iman & Habibi Lashkari, Arash & Ghorbani, Ali. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. 108-116. 10.5220/0006639801080116.
Hardware Specifications
- i9-9900KF Intel CPU overclocked to 5 GHz.
- 32 Gigabytes DDR4 RAM at 3200 MHz.