openlibrary Better Author Resolution: Don't create duplicate author records for M. <surname> with exact date match

Better Author Resolution: Don't create duplicate author records for M. <surname> with exact date match

Open tfmorris opened this issue 2 years ago • 5 comments

For author records which are a title and surname with an exact match on surname, birth and death dates, such as M. Anicet-Bourgeois (1806 - 1871) and Auguste Anicet-Bourgeois (1806 - 1871), that should be considered sufficient to not create a duplicate author record.

This duplicate was created just a couple of weeks ago, so it's not just a historical issue.

Notes by Mek

[ ] Author Resolution: Ignore case sensitivity for author
[ ] Author Resolution: Ignore titles e.g. “Dr.” or "M."
[ ] Author Resolution: Ignore punctuation “.”

Stakeholders

@hornc

Dec 31 '22 23:12 tfmorris

Actually, the problem is worse than described for this specific example because the MARC record actually includes the given name, but it got dropped during the import. The edition which caused this author record to get created is linked to MARC record where we can see that the author's given name is included in a 100$q subfield.

There may be other cases where this isn't true, so I still think the problem as original stated should also be addressed.

Dec 31 '22 23:12 tfmorris

This should be a subtask of an issue specific to improving Author Resolution is already a goal for the year :+1:

Jan 03 '23 20:01 mekarpeles

The specific issue here is that the M. is a title and appears in a 100$c subfield, which I don't think is used to form an alternate name to search against.

My first thought was that the author should be matched if M. Anicet-Bourgeois was an alternate name for the existing author, but I'm not sure the titles are being used for this matching.

It's possible the existing author only having the reversed form Anicet-Bourgeois M. as an alternate name is confusing the match.

Is Anicet-Bourgeois M. a real alternate name for this author, or should it be M. Anicet-Bourgeois, and the combined 100$a and 100$c should have produced and matched this exact string?

Jan 17 '23 00:01 hornc

M. is the French equivalent of Mr. I had missed the fact that it was coded in the 100$c. Possible name forms from

100 1 $aAnicet-Bourgeois,$cM.$q(Auguste),$d1806-1871.

include:

M. Anicet-Bourgeois
Auguste Anicet-Bourgeois
Monsieur Anicet-Bourgeois
M. Auguste Anicet-Bourgeois

although I think only the first two are common and the first is mostly limited to pre-20th century publications.

Using the information from the 100$q is pretty easy for a human, but there are many different cases for an algorithm to deal with.

Jan 17 '23 17:01 tfmorris

Please note that the one record links Wikidata, ISNI and VIAF identifiers, and that all of those identify alternate name forms which include the “M.” It would seem the answer is to just get on with importing those alternate name forms to the AKAs, as librarians have long wanted.

Oct 18 '23 17:10 LeadSongDog

openlibrary openlibrary copied to clipboard

Better Author Resolution: Don't create duplicate author records for M. <surname> with exact date match

Notes by Mek

Stakeholders

openlibrary
openlibrary copied to clipboard