XML-CNN What would be the format of the input dataset?

Hi there,

I am interested in trying XML-CNN on my own dataset. I have collection of documents, and their labels. Could you please help me understand how I can feed it to your tool? Or, if you provide me samples, that would also be helpful. I tried to go through the RCV file you mentioned in the README file, but it's not really clear. Thanks.

Feb 05 '19 21:02 negacy

Hi Negacy

The dataset is actually a python pickle file. Upon loading, it consists of 4 lists. 1st and 2nd are training and test sets. Each instance is a dictionary in those lists with keys ('text', 'num_words', 'split', 'catgy', 'Id'). The third item in the list is a word key which has keys as all the words in the dataset and value as a corresponding serial number for those words. The last item of the list are set of labels in the dataset but they are not used. Have a look at this file https://github.com/siddsax/XML-CNN/blob/master/utils/data_helpers.py for more help tracking back from load_data function

Feb 06 '19 00:02 siddsax

Thanks, @siddsax:

So, what I have is list of documents and their labels; for example, ['this is first document', ['label-1', 'label-2', 'label-99']], ['this is second document', 'label-7'] ... etc. I think I need to modify line 107 and 108 of the script. But, what would be the values of m and n? Is m the maximum number of tokens in a document? n is the maximum number of labels assigned to a document in the training/testing?

Feb 12 '19 22:02 negacy