[dataset] CoNLL2003 wrapper
Write a wrapper for the CoNLL2003 dataset. Annotate the license, experiment type and language. Give provenance. Update https://github.com/AKSW/gerbil/wiki/Licences-for-datasets
Is there a CoNLL2003 dataset that is open? I have a CoNLL2009 NIF Converter I could adapt if it was a plain text conll file.
@giusepperizzo @der-bruemmer from here http://www.cnts.ua.ac.be/conll2003/ner/ I do not see a licence. So I would assume we are good to us it.
According to its README http://www.cnts.ua.ac.be/conll2003/ner/000README where it says you have to have the Reuters CD (which is licences under nist). But for the Reuters CD there is a filled-out form at AKSW.
Conclusion: We can use this dataset but we are not allowed to put it in the public download folder.
@der-bruemmer @RicardoUsbeck I confirm that CoNLL2003 goes with a NIST license. Everybody who signs the agreement can get the corpus and use it. Obviously, who receives the corpus cannot share it. Hence, we aren't entitled to put in a public repository, but we can provide pointers where to download it, and provide tools for parsing it.
@der-bruemmer the CoNLL2003 looks like (excerpt): JAPAN NNP I-NP I-LOC GET VB I-VP O LUCKY NNP I-NP O WIN NNP I-NP O Does it fit your parser?
I have to adapt the parsing of columns. I wrote it for CoNLL 2009 which was dependency parsing. I don't have the Reuters data myself, so I can only rewrite the parser using the CoNLL specification. A problem then is, that if the data is only a single file, it will be a single context / document in NIF. But the Reuters data will be a large number of annotated documents. I currently don't know how to detect document borders in CoNLL files.
@RicardoUsbeck @giusepperizzo where can I acquire the Reuters CD / data?
Just mentioning: This is an issue for milestone 2 after WWW deadline.
Axel (@ngonga) has the CD :)
@der-bruemmer one file for each portion of the corpus, i.e. 1 file for eng.train, 1 file for eng.testa (dev set), 1 file for eng.testb. For the stats, you may check [1] in Table 1. So in each file you find several documents. Borders are \n\n.
[1] http://www.eurecom.fr/~rizzo/publications/Rizzo_Erp-LREC2014.pdf
The Reuters corpus can be obtained without any charges from NIST: http://trec.nist.gov/data/reuters/reuters.html. Another useful resource is of course the LDC catalog
Take care, there are many CoNLL formats! And 2009 is different from 2003. @giusepperizzo gave you the parsing rules for 2003
Dear @RicardoUsbeck , I couldn't find the dataset. I contacted Prof. Axel regarding the dataset CD but he said that it is not in his possession.
Kindly let me know how to proceed on this.
We close this dataset for until someone provides use the original Reuters CD.
The CoNLL 2003 dataset is present in numerous github repositories, e.g. in https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003.
You can also download and re-built it from https://www.clips.uantwerpen.be/conll2003/ner/.
Who guaranties that the data in a random github repository is original and not modified? In the provided repository: "WARNING: there are many tags that differ from the original Conll03 corpus"
To rebuild from https://www.clips.uantwerpen.be/conll2003/ner, you are need of the Reuters CD as Ricardo mentioned already. So you can not download and re-build it without the CD.
I do have an unaltered version of the dataset, built from the Reuters CD. Let me know if you need a transfer if you can show that you have signed once the license agreement. I think having CoNLL 2003 in Gerbil is a must
We got the dataset and are going to rebuild it. Thanks @rtroncy for helping.