openlibrary
openlibrary copied to clipboard
Better Author Resolution: Don't create duplicate author records for M. <surname> with exact date match
For author records which are a title and surname with an exact match on surname, birth and death dates, such as M. Anicet-Bourgeois (1806 - 1871) and Auguste Anicet-Bourgeois (1806 - 1871), that should be considered sufficient to not create a duplicate author record.
This duplicate was created just a couple of weeks ago, so it's not just a historical issue.
Notes by Mek
- [ ] Author Resolution: Ignore case sensitivity for author
- [ ] Author Resolution: Ignore titles e.g. “Dr.” or "M."
- [ ] Author Resolution: Ignore punctuation “.”
Stakeholders
@hornc
Actually, the problem is worse than described for this specific example because the MARC record actually includes the given name, but it got dropped during the import. The edition which caused this author record to get created is linked to MARC record where we can see that the author's given name is included in a 100$q
subfield.
There may be other cases where this isn't true, so I still think the problem as original stated should also be addressed.
This should be a subtask of an issue specific to improving Author Resolution is already a goal for the year :+1:
The specific issue here is that the M.
is a title and appears in a 100$c
subfield, which I don't think is used to form an alternate name to search against.
My first thought was that the author should be matched if M. Anicet-Bourgeois
was an alternate name for the existing author, but I'm not sure the titles are being used for this matching.
It's possible the existing author only having the reversed form Anicet-Bourgeois M.
as an alternate name is confusing the match.
Is Anicet-Bourgeois M.
a real alternate name for this author, or should it be M. Anicet-Bourgeois
, and the combined 100$a
and 100$c
should have produced and matched this exact string?
M. is the French equivalent of Mr. I had missed the fact that it was coded in the 100$c. Possible name forms from
100 1 $aAnicet-Bourgeois,$cM.$q(Auguste),$d1806-1871.
include:
- M. Anicet-Bourgeois
- Auguste Anicet-Bourgeois
- Monsieur Anicet-Bourgeois
- M. Auguste Anicet-Bourgeois
although I think only the first two are common and the first is mostly limited to pre-20th century publications.
Using the information from the 100$q is pretty easy for a human, but there are many different cases for an algorithm to deal with.
Please note that the one record links Wikidata, ISNI and VIAF identifiers, and that all of those identify alternate name forms which include the “M.” It would seem the answer is to just get on with importing those alternate name forms to the AKAs, as librarians have long wanted.