gum
gum copied to clipboard
Coreference issue?
Hi, I looked at your coreference dataset for doing some training (coref/ontogum/conll) and I found that there may be some inconsistencies in coreference annotation. I only looked at biographies files, but I often noticed that the same protagonist of the biography is referred with 2 different entity ids (eg.: holt is botth referred as 1 and 4). Is this annotation expected? (I am a newbe in the field of coreference resolution)
Best Marco
Hi Marco, I'm adding @yilunzhu who created the OntoGUM version of the annotations, which is described in this paper in detail.
Looking at the document you're referring to (GUM_bio_holt), the original annotations are actually correct and there is only one group for Holt. If you are using the conllu format, where there are IDs like 1 and 4, you can see this here:
https://github.com/amir-zeldes/gum/blob/master/coref/ontogum/conllu/GUM_bio_holt.conllu
The IDs 1 and 4 do apper in the OntoGUM version, and I think I agree that they are wrong, but it requires some background to explain them: OntoGUM is an attempt to convert the GUM data to follow the OntoNotes coreference resolution scheme, which is quite different than GUM's native annotations. The differences are covered in detail in this paper and more briefly in this one.
The specific difference which causes the chain for "Holt" to be split into two, is that OntoNotes forbids indefinite mentions from having antecedents. According to the OntoNotes guidelines, the following text has three separate chains with different IDs x/y/z for 'parents':
[Parents]x should be involved with their children's education
at home, not in school. [They]x should see to it that [their]x
kids don't play truant; [they]x should make certain that the
children spend enough time doing homework; [they]x should
scrutinize the report card. [Parents]y are too likely to blame
schools for the educational limitations of [their]y children. If
[parents]z are dissatisfied with a school, [they]z should have
the option of switching to another.
This behavior is by design in OntoNotes, and there is a long literature debating it. Unfortunately, it sometimes has bizarre consequences, such as for headings. For example:
Announcement of [new law]x May be [its]x Undoing
[A new law]y was announced today and [it]y ...
Because "a law" is indefinite, and headings often allow a first mention to be indefinite in the beginning of the article, we get strangely split groups.
Holt's split is caused by indefinite mentions like "second baseman", which is part of a stats table. I'm not sure if this case is truly following ON guidelines, because there is another guideline which says that proper nouns should be chained together if referring to the same person, regardless of everything else. But since this chain is mixed, it's an odd situation.
Anyway this has become a rather long answer but I hope it helps to understand what is going on. @yilunzhu - what do you think, should the conversion just implement an override for this guideline whenever the chain has a proper name at any point?
PS - regardless of this quirk, problems like these are fairly rare, so most coreferring mentions in the OntoGUM version are fine and uncontroversial, you just caught an interesting one.
Thanks @amir-zeldes for the detailed explaination! I think it's more intuitive to connect the chain when a proper name is in the middle. We will let you know when this implementation is ready.
Sounds good, I can recompile it for the upcoming UD release 2.11 (data freeze Nov. 1, release Nov. 15)
The code has been updated here.
thanks (and sorry for the late reply)