extremeText icon indicating copy to clipboard operation
extremeText copied to clipboard

Dataset format expected

Open shashankg7 opened this issue 4 years ago • 2 comments

Hi,

What is the dataset format expected for multi-label classification?

shashankg7 avatar May 22 '20 10:05 shashankg7

Hi @shashankg7, the dataset format is fastText data format with few extension:

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1> <word2> <word3...>

It is possible to add weighting for each word by adding -wordsWeights option and using the following format :

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1>:<word 1 wieght> <word2>:<word 2 wieght> <word3...>:<word 3 wieght...>

See xml_experiments directory for some examples. run_EURLex-4K.sh is the smallest from all the datasets.

mwydmuch avatar May 23 '20 11:05 mwydmuch

Thanks a lot @mwydmuch for your reply.

I am able to run the code with the format you have described. Thanks!

I have one doubt. I am trying out your model on a custom multi-label short text classification (average word length of ~4). The #labels are in order of 3.5K.

I am trying out 'plt' loss function with #dimensions in [200, 300, 500]. I tried different epochs and I have also tried out varying char n-gram sizes.

But I am not able to get good results, when compared to fasttext.

Any suggestions to where I might be going wrong, or what else I could try.

Thanks

shashankg7 avatar May 25 '20 20:05 shashankg7