gum icon indicating copy to clipboard operation
gum copied to clipboard

Coreference issue?

Open marcostranisci opened this issue 1 year ago • 3 comments

Hi, I looked at your coreference dataset for doing some training (coref/ontogum/conll) and I found that there may be some inconsistencies in coreference annotation. I only looked at biographies files, but I often noticed that the same protagonist of the biography is referred with 2 different entity ids (eg.: holt is botth referred as 1 and 4). Is this annotation expected? (I am a newbe in the field of coreference resolution)

Best Marco

marcostranisci avatar Oct 11 '22 14:10 marcostranisci

Hi Marco, I'm adding @yilunzhu who created the OntoGUM version of the annotations, which is described in this paper in detail.

Looking at the document you're referring to (GUM_bio_holt), the original annotations are actually correct and there is only one group for Holt. If you are using the conllu format, where there are IDs like 1 and 4, you can see this here:

https://github.com/amir-zeldes/gum/blob/master/coref/ontogum/conllu/GUM_bio_holt.conllu

The IDs 1 and 4 do apper in the OntoGUM version, and I think I agree that they are wrong, but it requires some background to explain them: OntoGUM is an attempt to convert the GUM data to follow the OntoNotes coreference resolution scheme, which is quite different than GUM's native annotations. The differences are covered in detail in this paper and more briefly in this one.

The specific difference which causes the chain for "Holt" to be split into two, is that OntoNotes forbids indefinite mentions from having antecedents. According to the OntoNotes guidelines, the following text has three separate chains with different IDs x/y/z for 'parents':

[Parents]x should be involved with their children's education
at home, not in school. [They]x should see to it that [their]x
kids don't play truant; [they]x should make certain that the
children spend enough time doing homework; [they]x should
scrutinize the report card. [Parents]y are too likely to blame
schools for the educational limitations of [their]y children. If
[parents]z are dissatisfied with a school, [they]z should have
the option of switching to another.

This behavior is by design in OntoNotes, and there is a long literature debating it. Unfortunately, it sometimes has bizarre consequences, such as for headings. For example:

Announcement of [new law]x May be [its]x Undoing

[A new law]y was announced today and [it]y ...

Because "a law" is indefinite, and headings often allow a first mention to be indefinite in the beginning of the article, we get strangely split groups.

Holt's split is caused by indefinite mentions like "second baseman", which is part of a stats table. I'm not sure if this case is truly following ON guidelines, because there is another guideline which says that proper nouns should be chained together if referring to the same person, regardless of everything else. But since this chain is mixed, it's an odd situation.

Anyway this has become a rather long answer but I hope it helps to understand what is going on. @yilunzhu - what do you think, should the conversion just implement an override for this guideline whenever the chain has a proper name at any point?

PS - regardless of this quirk, problems like these are fairly rare, so most coreferring mentions in the OntoGUM version are fine and uncontroversial, you just caught an interesting one.

amir-zeldes avatar Oct 12 '22 22:10 amir-zeldes

Thanks @amir-zeldes for the detailed explaination! I think it's more intuitive to connect the chain when a proper name is in the middle. We will let you know when this implementation is ready.

yilunzhu avatar Oct 13 '22 02:10 yilunzhu

Sounds good, I can recompile it for the upcoming UD release 2.11 (data freeze Nov. 1, release Nov. 15)

amir-zeldes avatar Oct 13 '22 14:10 amir-zeldes

The code has been updated here.

yilunzhu avatar Nov 01 '22 01:11 yilunzhu

thanks (and sorry for the late reply)

marcostranisci avatar Nov 20 '22 20:11 marcostranisci