XML-CNN
XML-CNN copied to clipboard
What would be the format of the input dataset?
Hi there,
I am interested in trying XML-CNN on my own dataset. I have collection of documents, and their labels. Could you please help me understand how I can feed it to your tool? Or, if you provide me samples, that would also be helpful. I tried to go through the RCV file you mentioned in the README file, but it's not really clear. Thanks.
Hi Negacy
The dataset is actually a python pickle file. Upon loading, it consists of 4 lists. 1st and 2nd are training and test sets. Each instance is a dictionary in those lists with keys ('text', 'num_words', 'split', 'catgy', 'Id').
The third item in the list is a word key which has keys as all the words in the dataset and value as a corresponding serial number for those words. The last item of the list are set of labels in the dataset but they are not used. Have a look at this file https://github.com/siddsax/XML-CNN/blob/master/utils/data_helpers.py
for more help tracking back from load_data
function
Thanks, @siddsax:
So, what I have is list of documents and their labels; for example, ['this is first document', ['label-1', 'label-2', 'label-99']], ['this is second document', 'label-7'] ... etc. I think I need to modify line 107 and 108 of the script. But, what would be the values of m
and n
? Is m
the maximum number of tokens in a document? n
is the maximum number of labels assigned to a document in the training/testing?