StratosphereLinuxIPS icon indicating copy to clipboard operation
StratosphereLinuxIPS copied to clipboard

Do a better documentation on how to use the mlflow module, and video examples on how to retrain it in your traffic

Open AlyaGomaa opened this issue 1 year ago • 3 comments

Created by Alya Gomaa via monday.com integration. 🎉

AlyaGomaa avatar Jul 02 '24 17:07 AlyaGomaa

Hey @eldraco I read the detection_modules.md file. It has just 1 line about ML flows module. I can add a paragraph about it, similar to other detections. I need more information on how ML flows. We can adda video later once we get traffic that triggers the module.

Looking at the code, I understood following(not everything will go in documentation)

  1. ML Flow module, is a binary classification.
  2. It is trained on sets of packets, if found none 2 packets are hardcoded to train
  3. Training model is saved in bin file, for prediction later
  4. Training packets, have features like "icmp, icmp-v6, udp, tcp, source ip, destination ip, total packets send, total bytes".
  5. Packets without ports number are not part of features.
  6. Module uses SGD classifier for training and testing using partial_fit method.

patel-lay avatar Jul 17 '24 23:07 patel-lay

Yes, please add a new section to that file, but not only one paragraph, because it will not be enough to explain everything. I suggest that you write about it while you use it and get comfortable with it.

  1. It uses a scaler object to scale all the values to the same range, like a normalization. It is trained during training, and stored in a bin file, so it can be used later in testing. This is important because the scale of the training data should be the same as the scale of the testing data. However, new testing data may arrive outside the scale, so we are careful.
  2. the model used now is a SGDClassifier() from scikit-learn, which is a linear classifier. This is not a good model, but we use it to test the idea and because scikit-learn can do transfer-learning using this model.
  3. ICMP and other protocols are not features. Read the code, this are protocols to ignore. The rest are features
  4. Features must be numeric for now. So no categorical features are used.
  5. Use use transfer learning to incorporate the packets from users into the model deployed within Slips when you clone the repo. This means that you can 'extend' the original model with your data without loosing the capabilities of the original model.
  6. Src IP and dst IP are never used, so the model is not biased towards a particular IP.
  7. The two hardcoded flows must be used, because when you 'transfer-learning' your traffic at home you need some 'benign' and some 'malicious' flow to have a classification model. If you try to train a classification model with only one class, it will fail. So this gives a very small flow as 'starting' point

eldraco avatar Jul 18 '24 12:07 eldraco

Also be careful that it was reported that the training was failing now. So I'm fixing it

eldraco avatar Jul 18 '24 12:07 eldraco