botnet-detection icon indicating copy to clipboard operation
botnet-detection copied to clipboard

Could you provide the input data format?

Open velpc opened this issue 3 years ago • 15 comments

Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.

velpc avatar Apr 08 '21 02:04 velpc

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

jzhou316 avatar Apr 08 '21 20:04 jzhou316

Thank you so much! they are very timely and helpful. Can you provide information on how to generate an xx_split_idx.pkl file from the dataset and its storage format?

velpc avatar Apr 11 '21 12:04 velpc

The xx_split_idx.pkl stores indexes of how to split the original large graph dataset in the HDF5 format into train/validation/test sets. It is a dictionary with keys "train", "val", and "test", where each value is a list of graph id numbers in the corresponding subset. We use this to split the original single graph dataset into separate storage for train/validation/test in preprocessing as in here. For our dataset, these splits are randomly generated based on the total number of graphs in the dataset with ratio 8:1:1 for train/validation/test, and fixed thereafter for community use.

jzhou316 avatar Apr 12 '21 14:04 jzhou316

Thanks for the clean datasets! One issue I have regarding the data specification:

graph_data_storage.md specifies x as node signals/features, but I can't find these in any of the hdf5 files. Furthermore, README.md suggests these are featureless graphs. Can you clarify?

jackd avatar Apr 12 '21 22:04 jackd

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

jzhou316 avatar Apr 12 '21 23:04 jzhou316

Thanks @jzhou316, that makes perfect sense - but it might be a nice addition to graph_data_storage.md :).

jackd avatar Apr 12 '21 23:04 jackd

cool. I'll add some details

jzhou316 avatar Apr 13 '21 00:04 jzhou316

The detailed instructions are very helpful. How to set num_evils and num_evils_avg if our problem is for multiclassification but not biclassification (evil/non-evil)?

velpc avatar Apr 13 '21 01:04 velpc

@velpc These are dataset statistics stored in the HDF5 file (and may not be used by the model). For different specific problems such as multiclassification, you can write your own data following our format with your other dataset attributes. For example, you could have attributes such as "num_class_0" "num_class_1" "num_class_2" etc. to describe the dataset. We have some example code of writing these attributes here. Hope this answers your question!

jzhou316 avatar Apr 14 '21 00:04 jzhou316

Hi @jzhou316, is there a much much smaller dataset that can be used for quick testing of the algorithm? I wanted to try out with a smaller subset without having to download these ones specified on dataset_botnet.py file. Thanks

iohelder avatar May 05 '21 08:05 iohelder

@helmoai Sorry that we currently don't have an official mini dataset for quick testing. Could you download the data and take out a subset (e.g. a few graphs) to run the mini-test? Otherwise I could generate a smaller subset from one of the datasets for you.

jzhou316 avatar May 05 '21 15:05 jzhou316

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

Found an issue in the code you gave, to read the hdf5 files here. I think you missed the h5py.File when opening the file. It should be:

import h5py
with h5py.File('filename', "r") as f:
    e = f['0']['edge_index'][()]             # take out the edge indexes from the first graph with id '0'
    num_nodes = f['0'].attrs['num_nodes']    # access the statistics stored in attributes of the first graph with id '0'
    num_graphs = f.attrs['num_graphs']       # access the statistics stored in attributes of the dataset file

iohelder avatar Jul 09 '21 07:07 iohelder

@helmoai yes you are right. Thanks for pointing it out! Updated it.

jzhou316 avatar Jul 09 '21 13:07 jzhou316

In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.

whxuexi avatar May 24 '22 08:05 whxuexi

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

I've been implementing this on a different network dataset and noticed a few gotchas related to this. If you use the botgen/ code to generate your data, it adds the dummy vector. As a result, add_nfeat_ones=True to add it at training time causes an error. Additionally, the botgen code does not add is_directed or self_directed to the data, so you will need to do that manually.

tillson avatar Dec 11 '23 03:12 tillson