gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

[dataset] CoNLL2003 wrapper

Open RicardoUsbeck opened this issue 11 years ago • 13 comments

Write a wrapper for the CoNLL2003 dataset. Annotate the license, experiment type and language. Give provenance. Update https://github.com/AKSW/gerbil/wiki/Licences-for-datasets

RicardoUsbeck avatar Nov 04 '14 15:11 RicardoUsbeck

Is there a CoNLL2003 dataset that is open? I have a CoNLL2009 NIF Converter I could adapt if it was a plain text conll file.

der-bruemmer avatar Nov 05 '14 08:11 der-bruemmer

@giusepperizzo @der-bruemmer from here http://www.cnts.ua.ac.be/conll2003/ner/ I do not see a licence. So I would assume we are good to us it.

According to its README http://www.cnts.ua.ac.be/conll2003/ner/000README where it says you have to have the Reuters CD (which is licences under nist). But for the Reuters CD there is a filled-out form at AKSW.

Conclusion: We can use this dataset but we are not allowed to put it in the public download folder.

RicardoUsbeck avatar Nov 05 '14 08:11 RicardoUsbeck

@der-bruemmer @RicardoUsbeck I confirm that CoNLL2003 goes with a NIST license. Everybody who signs the agreement can get the corpus and use it. Obviously, who receives the corpus cannot share it. Hence, we aren't entitled to put in a public repository, but we can provide pointers where to download it, and provide tools for parsing it.

@der-bruemmer the CoNLL2003 looks like (excerpt): JAPAN NNP I-NP I-LOC GET VB I-VP O LUCKY NNP I-NP O WIN NNP I-NP O Does it fit your parser?

giusepperizzo avatar Nov 05 '14 08:11 giusepperizzo

I have to adapt the parsing of columns. I wrote it for CoNLL 2009 which was dependency parsing. I don't have the Reuters data myself, so I can only rewrite the parser using the CoNLL specification. A problem then is, that if the data is only a single file, it will be a single context / document in NIF. But the Reuters data will be a large number of annotated documents. I currently don't know how to detect document borders in CoNLL files.

@RicardoUsbeck @giusepperizzo where can I acquire the Reuters CD / data?

der-bruemmer avatar Nov 05 '14 09:11 der-bruemmer

Just mentioning: This is an issue for milestone 2 after WWW deadline.

Axel (@ngonga) has the CD :)

RicardoUsbeck avatar Nov 05 '14 09:11 RicardoUsbeck

@der-bruemmer one file for each portion of the corpus, i.e. 1 file for eng.train, 1 file for eng.testa (dev set), 1 file for eng.testb. For the stats, you may check [1] in Table 1. So in each file you find several documents. Borders are \n\n.

[1] http://www.eurecom.fr/~rizzo/publications/Rizzo_Erp-LREC2014.pdf

giusepperizzo avatar Nov 05 '14 10:11 giusepperizzo

The Reuters corpus can be obtained without any charges from NIST: http://trec.nist.gov/data/reuters/reuters.html. Another useful resource is of course the LDC catalog

Take care, there are many CoNLL formats! And 2009 is different from 2003. @giusepperizzo gave you the parsing rules for 2003

rtroncy avatar Nov 05 '14 10:11 rtroncy

Dear @RicardoUsbeck , I couldn't find the dataset. I contacted Prof. Axel regarding the dataset CD but he said that it is not in his possession.

Kindly let me know how to proceed on this.

nikit-srivastava avatar Mar 30 '18 10:03 nikit-srivastava

We close this dataset for until someone provides use the original Reuters CD.

RicardoUsbeck avatar Mar 30 '18 10:03 RicardoUsbeck

The CoNLL 2003 dataset is present in numerous github repositories, e.g. in https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003.

You can also download and re-built it from https://www.clips.uantwerpen.be/conll2003/ner/.

rtroncy avatar Mar 30 '18 11:03 rtroncy

Who guaranties that the data in a random github repository is original and not modified? In the provided repository: "WARNING: there are many tags that differ from the original Conll03 corpus"

To rebuild from https://www.clips.uantwerpen.be/conll2003/ner, you are need of the Reuters CD as Ricardo mentioned already. So you can not download and re-build it without the CD.

cO68Iy avatar Apr 02 '18 15:04 cO68Iy

I do have an unaltered version of the dataset, built from the Reuters CD. Let me know if you need a transfer if you can show that you have signed once the license agreement. I think having CoNLL 2003 in Gerbil is a must

rtroncy avatar Apr 02 '18 18:04 rtroncy

We got the dataset and are going to rebuild it. Thanks @rtroncy for helping.

RicardoUsbeck avatar Apr 11 '18 06:04 RicardoUsbeck