ner-corpora FR corpus

FR corpus

Open CatarinaPC opened this issue 4 years ago • 8 comments

I have noticed a particularity regarding the French corpus.

Only the prefix I- is used to tag named-entities. Is this because, in fact, there is no need to use the B- prefix (when there are two different but adjacent named-entities) ?

I have noticed this in the IOB file in this repository, but also on https://lab.kb.nl/dataset/europeana-newspapers-ner. Where instead of a single IOB file, there are multiple .tag files (and .not files) with the same issue.

If possible I would also like to know what is the difference between the IOB file in this repository and the .tag files on https://lab.kb.nl/dataset/europeana-newspapers-ner.

Sep 30 '19 09:09 CatarinaPC

Apologies for the late reply!

Only the prefix I- is used to tag named-entities. Is this because, in fact, there is no need to use the B- prefix (when there are two different but adjacent named-entities) ?

In fact, the corpus for French NER was produced by a different partner than the corpora for Dutch/German, and the French partner used the IO classification scheme only for exactly this reason. Yet we have been harmonizing this for all corpora to use the BIO scheme - I simply need to update and push the changes to the v0.2 branch - hopefully soon.

I have noticed this in the IOB file in this repository, but also on https://lab.kb.nl/dataset/europeana-newspapers-ner. Where instead of a single IOB file, there are multiple .tag files (and .not files) with the same issue. If possible I would also like to know what is the difference between the IOB file in this repository and the .tag files on https://lab.kb.nl/dataset/europeana-newspapers-ner.

The data on https://lab.kb.nl/dataset/europeana-newspapers-ner is the raw data originally provided by the French institute and provided there for transparent provenance. For this repository here we already harmonized to a large extent the corpora for all languages to use the same format.

Nov 15 '19 14:11 cneud

@CatarinaPC fyi, I just pushed the changes to convert enp_FR from IO to BIO scheme to the v0.2 branch.

Nov 18 '19 15:11 cneud

the French partner used the IO classification scheme only for exactly this reason

What to you mean by "exactly this reason" ? The partner used IO classification because B- prefixs were not necessary? Can you further explain this?

#2 - I saw the new changes and I think you are using BIO-2, where the Begin prefix (B-) is used for each first word of an entity. Is that correct? (In BIO-1, the Begin tag is used for the first word, only if it follows an entity of the same type)

#3 - How did you make the conversion? Was it automatically? Or did you have to do it manually? I'm asking this because of adjacent entities of the same type. With the IO scheme, subsequent entities of the same type are not distinguished.

#4 - What are the different columns in the file? (there are two columns with tags)

#5 - I see that there is a wiki where you explain "data quality issues and instructions to clean up the data". Is everything explained there valid for the French dataset?

#6 - What changes are implemented in the v0.2-branch that are not in the master branch (besides the BIO conversion) ?

I'm sorry for so many questions Hope you can answer some of them Thank you, Catarina

Nov 20 '19 20:11 CatarinaPC

@CatarinaPC No worries, thank you for the questions.

Ad 1) What I meant is that the French partner used IO because they deemed it sufficient, but in my understanding this can cause problems if two entities of the same type (e.g. PER) are in direct sequence. It was therefore always sth that I wanted to rectify, also for reasons of consistency with the other languages.

Ad 2) TBH, I was not aware of the distinction between BIO-1/BIO-2. Do you have any references for that? I found it quite difficult to find a proper publication/documentation for BIO? But yes, based on what you write above, we definitely use the BIO-2 scheme, so always tag the first token of an entity span with B-*.

Ad 3) I made the conversion equipped with my favorite editor in one hand and various regular expressions in my other hand. So basically I walked through the whole file in a semi-automated way, but taking particular care of sequences, which can be ambiguous. So while I tried to take particular care of tricky cases, I cannot promise that no errors were introduced. Actually, going through the file I found also several annotations that seemed wrong to me, but which I did not correct, as I see this as another quality control step that should happen in the future.

Ad 4) The TSV files in the v0.2 branch adopt the data format used by the GermEval2014 Named Entity Recognition Shared Task, which also knows embedded entities, which are encoded in the second NE column. See also the "simple" format described here.

Ad 5) Regarding the wiki, I haven't had the time to update that in a while, but basically the information in Corpus-cleanup is still current. I am only working with the German dataset at the moment, and the IO to BIO conversion was the only real change in the French data done since v0.1.

Ad 6) No changes were made in the v0.2 branch for the French data other than the conversion to the GermEval TSV format and the change from IO to BIO(-2). In the longer run it would be interesting though to correct tokenization/sentence boundaries, correct the OCR errors and quality control the tagging. Help is any of the above is always welcome ;-)

I hope this answered your questions, feel free to follow up.

Nov 21 '19 15:11 cneud

Thank you so much for your answers!

Ad 2) TBH, I was not aware of the distinction between BIO-1/BIO-2. Do you have any references for that? I found it quite difficult to find a proper publication/documentation for BIO? But yes, based on what you write above, we definitely use the BIO-2 scheme, so always tag the first token of an entity span with B-*.

I've also found it difficult to find a proper reference for BIO encoding and encoding in general. I was looking at the website: https://donovanong.github.io/ner/tagging-scheme-for-ner.html . They mention a paper when referring to BIO encoding - Ramshaw and Marcus (1995). Do you think this is it?

Ad 4) The TSV files in the v0.2 branch adopt the data format used by the GermEval2014 Named Entity Recognition Shared Task, which also knows embedded entities, which are encoded in the second NE column. See also the "simple" format described here.

For this particular data, there are no embedded entities, right?

I'm using different libraries to train different models and my goal is to evaluate this models and discover which one is the best. I am also combining several French datasets I found, to train the models. I had, for this reason, to choose a standard format. And I chose to use the CoNLL-2003 format, for the simple reason that it was the one I had seen being used the most, which then led me to using BIO-1 (https://www.clips.uantwerpen.be/conll2003/ner/).

By telling you this I want to ask a question, and sorry if you think this is already too out of topic: -> If you know of any NER shared task for French? Because I wanted to evaluate the different models I trained and also compare them to some pre-trained models, using a test set, completely different from what the models were trained with. Do you know of anything? Or do you have any tips?

Actually, going through the file I found also several annotations that seemed wrong to me, but which I did not correct, as I see this as another quality control step that should happen in the future.

I too found some errors. I came to the conclusion that there were a lot of periods/dots annotated as entities. So I ran this regular expression, through the file (in the master branch): ^.\tI This finds all periods/dots identified as entities. I have some examples:

On line 8080, a dot is identified as I-LOC followed by a PER entity:

Al O . I-LOC J. I-PER M. I-PER Dauguet I-PER , O

On line 10636, a dot is also identified as an I-LOC:

à O Di I-LOC . I-LOC nard I-LOC , O

It should be just Dinard identified as I-LOC.

Looking at the file I've found that this is common and caused by OCR errors. To fix them I discover the original page and newspaper in which the error is, and I look at the images of the newspapers and correct the mistake. But this is so cumbersome. Just thought I'd leave this here.

Nov 22 '19 15:11 CatarinaPC

I've also found it difficult to find a proper reference for BIO encoding and encoding in general. I was looking at the website: https://donovanong.github.io/ner/tagging-scheme-for-ner.html . They mention a paper when referring to BIO encoding - Ramshaw and Marcus (1995). Do you think this is it?

Yes Ramshaw & Markus 1995 seems to be the most often used reference, so I also used this in the README.md. Thank you for the other link, I haven't seen that one before!

For this particular data, there are no embedded entities, right?

At the moment, none of the datasets here contain any annotations of embedded entities yet - but they sure contain embedded entities! It was just the case that the original version did not even foresee this and this was not part of the annotation, while now we would iteratively want to add this. We will soon start another annotation campaign for the German data where we will annotate embedded entities.

I'm using different libraries to train different models and my goal is to evaluate this models and discover which one is the best. I'm curious! What libraries are you using? We currently have very good results training BERT, cf. https://github.com/qurator-spk/sbb_ner and the publication here.

If you know of any NER shared task for French? Because I wanted to evaluate the different models I trained and also compare them to some pre-trained models, using a test set, completely different from what the models were trained with. Do you know of anything? Or do you have any tips?

The next Shared Task that I am aware of in this area would probably be https://impresso.github.io/CLEF-HIPE-2020/.

Looking at the file I've found that this is common and caused by OCR errors. To fix them I discover the original page and newspaper in which the error is, and I look at the images of the newspapers and correct the mistake. But this is so cumbersome. Just thought I'd leave this here.

Indeed the correction, especially of segmentation errors, is very cumbersome! What we have in mind is a complete reprocessing of the source data, as OCR quality has improved dramatically since the data was originally produced (thanks Deep Learning!). The challenge that then remains is to map the existing annotations to a new OCR output and, finally, go through the annotations once more manually to correct errors/add embedded entities.

Nov 22 '19 16:11 cneud

I'm curious! What libraries are you using? We currently have very good results training BERT, cf. https://github.com/qurator-spk/sbb_ner and the publication here.

I'm using Flair, SpaCy, AllenNLP and NeuroNER. Flair and SpaCy both have pre-trained models for French.

Besides that, I'm also going to experiment training with different data. That is why I asked you about French NER shared tasks because it would be good to have a test set not seen by the models I am training and the pre-trained models the libraries have available.

I remembered two other questions, that are related, that I hope you can answer.

What do you give as input to the model? Sentences?
What do you know about the tokenization performed in the French dataset?

I ask this because I will go through this dataset and try to obtain sentences to give to the models for training. I though about finding the sentences by finding periods "." that are end of sentences, i.e., rows in the tsv file that have solely "." in the token text column.

I already know this will be a problem because of OCR errors that insert random "." in the text. As an example we have the excerpt i gave as example in my previous comment:

à O Di I-LOC . I-LOC nard I-LOC , O

That "." would be considered the end of a sentence but it would be wrong.

Nov 29 '19 17:11 CatarinaPC

That is why I asked you about French NER shared tasks

Train/dev sets (including French) for the CLEF-HIPE-2020 shared task are now being released.

What do you give as input to the model? Sentences?

For the supervised training, we used the SoMaJo tokenizer for sentence splitting, which appears to work really well for German, but we did not manually review the whole dataset yet. We also did not test any tokenizers for other languages.

What do you know about the tokenization performed in the French dataset?

I checked again and I found that you can still get the raw data by the French researchers here, which includes some documentation on the workflow used to produce this data. Unfortunately there is also no sentence splitting in these files.

Mar 12 '20 23:03 cneud

ner-corpora ner-corpora copied to clipboard

FR corpus

ner-corpora
ner-corpora copied to clipboard