tonic icon indicating copy to clipboard operation
tonic copied to clipboard

NF-UNSW-Dataset to Spiking

Open onlysubgroup opened this issue 3 years ago • 11 comments

Hi,

so I want to add the NSL-KDD Dataset(https://www.kaggle.com/datasets/hassan06/nslkdd) to tonic and make it spiking. I found hdf5 formatted and preprocessed files of the dataset in https://github.com/kyuyeonpooh/nsl-kdd-autoencoder/tree/master/hdf5 . Any advice besides, what's in https://tonic.readthedocs.io/en/latest/about/contribute.html ?

Greetings & Thanks in advance, onlysubgroup

onlysubgroup avatar Jun 13 '22 15:06 onlysubgroup

Hey @onlysubgroup, that's really interesting! So far I've only worked with neuromorphic sensor datasets, so I'm curious where this leads to. The fact that hdf5 files are available is already great. Let's hope Github won't block too many downloads. I guess the crucial question is how to make it spiking. Have you thought about this yet? Having had a quick look at NSL-KDD it seems as if there are some 40 features per data point. That seems like a lot of dimensions per spike! In comparison, spiking audio datasets normally have (time, channel) and vision datasets have (x, y, time, polarity). Are you planning to make use of timestamps in your algorithm?

biphasic avatar Jun 14 '22 08:06 biphasic

Hey @biphasic ,

thanks for the reply. Indeed it seems like an interesting task. Definitely, encoding the set into spikes will be the hardest part. Problem is, that the NSL-KDD dataset doesn't not have any flow characteristics. Maybe I use NetFlow Datasets (https://arxiv.org/pdf/2011.09144) instead..

Cheers, onlysubgroup

onlysubgroup avatar Jun 15 '22 13:06 onlysubgroup

You will need some kind of sequential / temporal nature I'd say. Are you planning to use a specific model (like a spiking neural network)?

biphasic avatar Jun 15 '22 14:06 biphasic

Hi,

current State of the Art datasets like "NetFlow V2 Datasets" ("https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA5") use NetFlows to identify intrusions. But still each flow is considered on it's own. I will try to rate-code the proposed features and see if it makes sense to use SNNs in this context.

Cheers, onlysubgroup

onlysubgroup avatar Jun 20 '22 11:06 onlysubgroup

First tests with tonic and norse: https://colab.research.google.com/drive/140OvbVrlEN4Cy_uK18EBk15kb80g1060?usp=sharing

onlysubgroup avatar Jul 07 '22 17:07 onlysubgroup

I don't have NF-UNSW_train.h5, where did you get it from? Also, is this a single sample? What are the features encoded?

biphasic avatar Jul 08 '22 07:07 biphasic

Thanks for your reply, should have also shared to preprocessing, my bad. But it's kind of messy still: https://colab.research.google.com/drive/1KHxPQ21aRPAFfXT7ZgXxP9U_AsbCpDgi?usp=sharing

Basically i did a 2/3 to 1/3 split on the dateset from: https://cloudstor.aarnet.edu.au/plus/s/N0JTc8JFNtZtUE4/download?path=%2F&files=NF-UNSW-NB15.csv and named the first 2/3 NF-UNSW_train.h5 and the second 1/3 test.h5. (I also dropped the IP-Address features because I haven't thought about how to encode it.)

I will automate the loading and preprocessing part later. First I will focus on making the model run, tough.

onlysubgroup avatar Jul 08 '22 10:07 onlysubgroup

Looking at the different features, I think one option would be to combine some of them.

How about you encode IPV4_SRC_ADDR and L4_SRC_PORT as source neuron location? Basically get a list of all combinations of those columns and that's the amount of neurons you encode it with. Same for IPV4_DST_ADDR and L4_DST_PORT for the target neurons.

Then you make them fire at certain times, maybe OUT_BYTES, OUT_PKTS and FLOW_DURATION_MILLISECONDS could be combined to a rate code of spikes? Plus connections in the other direction for IN_BYTES, IN_PKTS

Not sure what to do about PROTOCOL, L7_PROTO, TCP_FLAGS.

I assume you want to learn to predict the label/attack.

biphasic avatar Jul 11 '22 08:07 biphasic

Thanks for the reply and the ideas. Yes, I want to predict the label/attack. I will try it out and keep you up to date.

onlysubgroup avatar Jul 12 '22 09:07 onlysubgroup

https://colab.research.google.com/drive/1DdIGoK8Hc0dokRnh_iaLGPAeXIODzbKY?usp=sharing Minimal Example of NF-UNSW to Spiking seem to work. I haven't checked for plausibility the results tough.

onlysubgroup avatar Jul 12 '22 13:07 onlysubgroup

Looking at the different features, I think one option would be to combine some of them.

How about you encode IPV4_SRC_ADDR and L4_SRC_PORT as source neuron location? Basically get a list of all combinations of those columns and that's the amount of neurons you encode it with. Same for IPV4_DST_ADDR and L4_DST_PORT for the target neurons.

Then you make them fire at certain times, maybe OUT_BYTES, OUT_PKTS and FLOW_DURATION_MILLISECONDS could be combined to a rate code of spikes? Plus connections in the other direction for IN_BYTES, IN_PKTS

Not sure what to do about PROTOCOL, L7_PROTO, TCP_FLAGS.

I assume you want to learn to predict the label/attack.

Hey,

I looked up the Netflow Paper again (https://arxiv.org/pdf/2011.09144.pdf), they advice not to encode src/dst ip/port because it can lead to overfitting. I will just rate code the other features and see if I can get similar results, like the authors of the aforementioned paper.

Cheers, onlysubgroup

onlysubgroup avatar Aug 02 '22 13:08 onlysubgroup

closing this for now. Please get in touch on the Discord channel if you have further questions!

biphasic avatar Dec 27 '22 09:12 biphasic