acl2020-transition-discontinuous-ner
acl2020-transition-discontinuous-ner copied to clipboard
dev.id, train.id, test.id
Hi, I am using your code for discontinuous and overlapped NER on some other dataset and I am not able to figure out what these three files are. dev.id, train.id, test.id? I know they are used for splitting the dataset into train, dev, and test but I want to know how you produced these .id files.
Hi,
Here is the description from our paper:
As CADEC does not have an official train-test split, we follow Metke-Jimenez and Karimi (2016) and randomly assign 70% of the posts as the training set, 15% as the development set, and the remaining posts as the test set.
I basically create a list consisting of all post ids, randomly shuffle the list, and then the first 70% as training, the next 15% as dev.
Does this answer help?
Yes, I downloaded the CADEC dataset and now it makes sense. But I have one more issue. While running the split_train_test.py to create train.txt, dev.txt, and test.txt, I am getting the below error on the CADEC dataset. Mentions are always an empty string and hence fail at assert. I successfully built the rest of the files: ann, tokens, tokens.ann and text-inline.
I am not sure what wrong I am doing. Can you share the files so I can see how should they look like? If you see the screenshot below, code line 38 doesn't make sense and I guess that's causing the problem as there is nothing like "Document: " in the text-inline file, and calling the "next" function loads mentions in tokens and hence mentions remain empty as it comes to the end of the line.
These are the files that I have generated so far. Archive.zip

Hi just want to make sure are you using https://github.com/daixiangau/acl2020-transition-discontinuous-ner/blob/master/data/cadec/build_data_for_transition_discontinuous_ner.sh?
If i remember correctly, when you run convert_text_inline.py, you can set whether to add `Document:', the default should be (no_doc_info = False)