DikeDataset icon indicating copy to clipboard operation
DikeDataset copied to clipboard

What should I do from labeling step 3?

Open recsater opened this issue 2 years ago • 3 comments

To compute the membership on each malware family, a transformer was developed (see the observation above) to "vote" for each available family. For example, if an antivirus engine tag was Trj, then one vote for the trojan family was offered. All tags were consumed in this way and the votes for all families were normalized.

(see the observation above)

https://github.com/iosifache/dike/blob/main/codebase/scripts/continuous_vt_scan.py

I entered this link, but I didn't know from labeling step 3.

What should I do from labeling step 3?

From

image

To

image

recsater avatar Jun 04 '22 07:06 recsater

Hi, @recsater, The script only deals with dumping that raw data into a CSV file from Google Cloud Storage. After achieving the scanning step, you need to create your own labeling strategy or adapt the dike's one. You can check dike's implementation in the update_malware_labels function from dataset module. There, the votes and tags are processes to obtain the malice and the families' ownership.

iosifache avatar Jun 04 '22 10:06 iosifache

Hi, @recsater, The script only deals with dumping that raw data into a CSV file from Google Cloud Storage. After achieving the scanning step, you need to create your own labeling strategy or adapt the dike's one. You can check dike's implementation in the update_malware_labels function from dataset module. There, the votes and tags are processes to obtain the malice and the families' ownership.

First of all, thank you for your reply.

As an additional question, I would like to get exactly the same constant used to make the DikeDataset labels.

Because I'm working on a project to classify malicious code using labels(malware.csv, benign.csv) from DikeDataset.

To do that, can I know the following values?

In Class DataFolderScanner, self._malware_families self._malicious_benign_votes_ratio self._min_ignored_percent

These are defined like image

I am sorry for my bad English. thank you.

recsater avatar Jun 05 '22 18:06 recsater

dike used a YAML configuration file that contains all the configurable aspects of its functioning. You can find out the values you mentioned by checking the dataset section in the configuration.yaml file.

And I'm glad to hear that these repositories are useful! Please let me know if you have any other questions, I'm happy to help.

iosifache avatar Jun 06 '22 04:06 iosifache