tonic
tonic copied to clipboard
NF-UNSW-Dataset to Spiking
Hi,
so I want to add the NSL-KDD Dataset(https://www.kaggle.com/datasets/hassan06/nslkdd) to tonic and make it spiking. I found hdf5 formatted and preprocessed files of the dataset in https://github.com/kyuyeonpooh/nsl-kdd-autoencoder/tree/master/hdf5 . Any advice besides, what's in https://tonic.readthedocs.io/en/latest/about/contribute.html ?
Greetings & Thanks in advance, onlysubgroup
Hey @onlysubgroup, that's really interesting! So far I've only worked with neuromorphic sensor datasets, so I'm curious where this leads to. The fact that hdf5 files are available is already great. Let's hope Github won't block too many downloads. I guess the crucial question is how to make it spiking. Have you thought about this yet? Having had a quick look at NSL-KDD it seems as if there are some 40 features per data point. That seems like a lot of dimensions per spike! In comparison, spiking audio datasets normally have (time, channel) and vision datasets have (x, y, time, polarity). Are you planning to make use of timestamps in your algorithm?
Hey @biphasic ,
thanks for the reply. Indeed it seems like an interesting task. Definitely, encoding the set into spikes will be the hardest part. Problem is, that the NSL-KDD dataset doesn't not have any flow characteristics. Maybe I use NetFlow Datasets (https://arxiv.org/pdf/2011.09144) instead..
Cheers, onlysubgroup
You will need some kind of sequential / temporal nature I'd say. Are you planning to use a specific model (like a spiking neural network)?
Hi,
current State of the Art datasets like "NetFlow V2 Datasets" ("https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA5") use NetFlows to identify intrusions. But still each flow is considered on it's own. I will try to rate-code the proposed features and see if it makes sense to use SNNs in this context.
Cheers, onlysubgroup
First tests with tonic and norse: https://colab.research.google.com/drive/140OvbVrlEN4Cy_uK18EBk15kb80g1060?usp=sharing
I don't have NF-UNSW_train.h5, where did you get it from? Also, is this a single sample? What are the features encoded?
Thanks for your reply, should have also shared to preprocessing, my bad. But it's kind of messy still: https://colab.research.google.com/drive/1KHxPQ21aRPAFfXT7ZgXxP9U_AsbCpDgi?usp=sharing
Basically i did a 2/3 to 1/3 split on the dateset from: https://cloudstor.aarnet.edu.au/plus/s/N0JTc8JFNtZtUE4/download?path=%2F&files=NF-UNSW-NB15.csv and named the first 2/3 NF-UNSW_train.h5 and the second 1/3 test.h5. (I also dropped the IP-Address features because I haven't thought about how to encode it.)
I will automate the loading and preprocessing part later. First I will focus on making the model run, tough.
Looking at the different features, I think one option would be to combine some of them.
How about you encode IPV4_SRC_ADDR and L4_SRC_PORT as source neuron location? Basically get a list of all combinations of those columns and that's the amount of neurons you encode it with. Same for IPV4_DST_ADDR and L4_DST_PORT for the target neurons.
Then you make them fire at certain times, maybe OUT_BYTES, OUT_PKTS and FLOW_DURATION_MILLISECONDS could be combined to a rate code of spikes? Plus connections in the other direction for IN_BYTES, IN_PKTS
Not sure what to do about PROTOCOL, L7_PROTO, TCP_FLAGS.
I assume you want to learn to predict the label/attack.
Thanks for the reply and the ideas. Yes, I want to predict the label/attack. I will try it out and keep you up to date.
https://colab.research.google.com/drive/1DdIGoK8Hc0dokRnh_iaLGPAeXIODzbKY?usp=sharing Minimal Example of NF-UNSW to Spiking seem to work. I haven't checked for plausibility the results tough.
Looking at the different features, I think one option would be to combine some of them.
How about you encode IPV4_SRC_ADDR and L4_SRC_PORT as source neuron location? Basically get a list of all combinations of those columns and that's the amount of neurons you encode it with. Same for IPV4_DST_ADDR and L4_DST_PORT for the target neurons.
Then you make them fire at certain times, maybe OUT_BYTES, OUT_PKTS and FLOW_DURATION_MILLISECONDS could be combined to a rate code of spikes? Plus connections in the other direction for IN_BYTES, IN_PKTS
Not sure what to do about PROTOCOL, L7_PROTO, TCP_FLAGS.
I assume you want to learn to predict the label/attack.
Hey,
I looked up the Netflow Paper again (https://arxiv.org/pdf/2011.09144.pdf), they advice not to encode src/dst ip/port because it can lead to overfitting. I will just rate code the other features and see if I can get similar results, like the authors of the aforementioned paper.
Cheers, onlysubgroup
closing this for now. Please get in touch on the Discord channel if you have further questions!