acl-anthology icon indicating copy to clipboard operation
acl-anthology copied to clipboard

Inconsistency in variant name handling

Open mbollmann opened this issue 3 years ago • 9 comments

TL;DR: If name variants are defined in the XML, whether a variant is considered part of the "canonical name" depends on the order in which the XML files are read. I feel this is a bug.


Problem description

We can define name variants in the XML; e.g., author Hongying Zan appears in different XML files like so:

<author><first>Hongying</first><last>Zan</last></author>
<author><first>Hongying</first><last>Zan</last><variant script="hani"><first>红英</first><last>昝</last></variant></author>

When XML files are read and author names are processed, we call AnthologyIndex.resolve_name() to map them to an ID, which in turn calls AnthologyIndex.get_ids() to get existing IDs for a name, which in turn calls AnthologyIndex.set_canonical_name() if, and only if, the name did not have any ID yet.

This means which name will be the canonical one in the AnthologyIndex depends on whether it is encountered with or without a name variant first.

Consider:

from anthology.people import PersonName
from anthology.index import AnthologyIndex

name = PersonName("Hongying", "Zan")
name_with_variant = PersonName("Hongying", "Zan", variant=PersonName("红英", "昝", script="hani"))

idx = AnthologyIndex()
print(idx.get_ids(name))
print(idx.get_ids(name_with_variant))
print(idx.get_canonical_name("hongying-zan"))

print("----")

idx = AnthologyIndex()
print(idx.get_ids(name_with_variant))
print(idx.get_ids(name))
print(idx.get_canonical_name("hongying-zan"))

Output:

['hongying-zan']
['hongying-zan']
Hongying Zan
----
['hongying-zan']
['hongying-zan']
Hongying Zan (昝红英)

Expected behaviour would be getting the same canonical name in both cases.

(NB: This issue probably causes the difference in builds that I'm seeing in #1473.)

Solution

I'm not actually sure what the correct solution here should be. What should happen when a name has a variant defined within the XML, as opposed to our name_variants.yaml?

  • Should it be ignored for purposes of setting the canonical name?
  • Should it be treated as a different name from the one without a variant?
  • Should the variant propagate to all other occurrences of the name (but then what if there are multiple, conflicting variants in the XML)?

Pinging @davidweichiang in particular since he introduced some of this feature.

mbollmann avatar Aug 14 '21 11:08 mbollmann

Not intimately familiar with this feature but I would have expected name_variants.yaml to contain all variants. Could a new variant specified in the XML trigger an error until it is added to the YAML?

nschneid avatar Aug 14 '21 15:08 nschneid

I'm still trying to understand the issue, but what I think is that regardless of order, the author's id should be hongying-yan, their canonical name should be Hongying Zan, and 昝红英 should be a name variant.

If it is convenient, I'd suggest modifying the data structures so that there is a PersonName class that is a single name and a PersonNameListing class that has a primary name and one or more variants. Then a paper should have a list of PersonNameListings, not a list of PersonNames. But a person's canonical name should be a PersonName, not a PersonNameListing.

davidweichiang avatar Aug 14 '21 17:08 davidweichiang

Not intimately familiar with this feature but I would have expected name_variants.yaml to contain all variants.

It's an intentional distinction between global variants and paper-level variants. For context, see https://github.com/acl-org/acl-anthology/pull/1027#issuecomment-715309941 (and I now realize that @mjpost actually authored most of this feature and I should ping him too).

mbollmann avatar Aug 14 '21 19:08 mbollmann

I find it takes some time to swap this back in every time. It helps to have a good vocabulary. I’m not sold on the names, but David’s PersonName and PersonNameListing captures the phenomenon well. I think this maps to types/tokens. A PersonName is an actual person, who usually writes their name in a consistent fashion, but may change that over time, or put it down in different forms or with different scripts. A PersonNameListing is an instance of a name being written on the paper.

I think the <variant> problem here is that we have made a type-level annotation at the token level. Since not all Hongying Zans will use the same variant in another script, what we really know is that this specific person used that variant.

Many of our problems are rooted in trying to map PersonNameListings, which are observed data, to PersonNames, which are latent data. Part of the problem is that this process is half-implicit: some IDs come from name_variants, and some are inferred from paper tokens.

I wonder if the right way to go about this is to explicitly represent every person in the Anthology, and to then require every <author> tag to have an ID.

mjpost avatar Aug 14 '21 20:08 mjpost

I find it takes some time to swap this back in every time. It helps to have a good vocabulary. I’m not sold on the names, but David’s PersonName and PersonNameListing captures the phenomenon well. I think this maps to types/tokens. A PersonName is an actual person, who usually writes their name in a consistent fashion, but may change that over time, or put it down in different forms or with different scripts. A PersonNameListing is an instance of a name being written on the paper.

FWIW I find these terms totally opaque, and I interpret your description of them as exactly the opposite of how David used them.

I wonder if the right way to go about this is to explicitly represent every person in the Anthology, and to then require every <author> tag to have an ID.

I've wondered about that too in the course of #1473. Just as with bibkeys, resolving and matching people's names every time we instantiate the Anthology has the potential to result in inconsistent behaviour, makes it harder to reason about the result from just looking at the XML, and also increases build time. If we commit author IDs to the XML, we can see exactly what authors are considered to be the same person, and we can still fix errors with that assignment by updating the XML.

mbollmann avatar Aug 14 '21 23:08 mbollmann

A short fix might be to interpret a <variant> tag on a paper as denoting a new person with a new ID. So <first>matt</first><last>post</last> would be a different person from <first>matt</first><last>post</last><variant>matthew</variant>.

FWIW I find these terms totally opaque, and I interpret your description of them as exactly the opposite of how David used them.

Now I am further confused. What if we used for terminology

  • A Person, which is a person in the real world
  • A Name, which is a text string found on a paper, which optionally a first name and a variant

We could then move to explicit representations of people in a people database (say under data/yaml). A person has at least one name and potentially a cloud of names. The main ambiguity problem is then mapping names on papers to these people.

mjpost avatar Aug 14 '21 23:08 mjpost

A short fix might be to interpret a <variant> tag on a paper as denoting a new person with a new ID. So <first>matt</first><last>post</last> would be a different person from <first>matt</first><last>post</last><variant>matthew</variant>.

I fear that in practice, this would usually be the wrong thing to do, though.

Now I am further confused. What if we used for terminology

* A Person, which is a person in the real world

* A Name, which is a text string found on a paper, which optionally a first name and a variant

That makes a lot of sense to me. It'd be a lot more explicit than the (somewhat confusing) magic that happens within functions like get_ids() and resolve_name() right now.

mbollmann avatar Aug 16 '21 19:08 mbollmann

If <variant> is another name-part alongside <first> and <last>, I think I agree with @mjpost that by default these should be considered different names:

<author><first>Hongying</first><last>Zan</last></author>
<author><first>Hongying</first><last>Zan</last><variant script="hani"><first>红英</first><last>昝</last></variant></author>

It's exactly parallel with <author><last>Srinivas</last></author> and <author><first>B.</first> <last>Srinivas</last></author>, which would be different people unless explicitly merged.

On the other hand, if <variant> is supposed to insert a variant into the name variants database, then we should consider <first>Hongying</first> <last>Zan</last> to be one name and <last>昝</last><first>红英</first> to be another name. The combination <first>Hongying</first> <last>Zan</last> <variant><last>昝</last><first>红英</first></variant> should not be considered a name; it should be a new type of thing, which I called PersonNameListing and consists of a PersonName and zero or more variant PersonNames. If we make this distinction, then <first>Hongying</first> <last>Zan</last> <variant><last>昝</last><first>红英</first></variant> cannot become the canonical name, because it's not a name.

davidweichiang avatar Aug 17 '21 19:08 davidweichiang

The <variant> XML tag, as I understand it, is supposed to be used exclusively for a different rendering of the name in another script. If you consider that so far, it's exclusively used for Chinese-language papers in CCL 2020, treating a name with such a variant as different from that name without it means that basically all authors who published at CCL 2020 and another *ACL venue will have their author pages fragmented.

mbollmann avatar Aug 17 '21 19:08 mbollmann