Improve author name resolution
Partially closes #7349 Related: internetarchive/infogami#217
Feature.
Technical
This PR does three things:
- Make name search case insensitive (though really this is an Infogami change);
- Drop a variety of honorifics completely (so they are not imported, nor used for comparison to existing authors); and
- Change the author name resolution strategy.
This PR would change the author name resolution order as follows:
- match on name, with priority going to name + birth date + death date;
- match on alternate_names, with priority going to
alternate_names+ birth date + death date; and - match on surname with a matching birth date + death date required.
It does this in part by relying on an Infobase change to change the op
associated with ~ to use ILIKE rather than LIKE. See internetarchive/infogami#217
It also updates mock_site() to more accurately approximate the LIKE
(and nowILIKE) query done by PostgreSQL.
I should note this does not drop honorifics at the end of an author name, which was part of the problem in #7349. My rationale was this case would get picked up by alternate_names (once an alternate name has been added..), and only operating at the start would of a name would be less likely to make an unintended removal. However, it would be easy to remove honorifics at the end if that is also desired.
This PR also does not address punctuation removal, which was called for in #7349. That's partially because I simply wasn't sure if I should, as this could result in overhead or required changes to the database structure, depending on the manner in which we do this.
However, one strategy that could use indexing and require no changes to the DB would be to use PostgreSQLs _single character wildcard withLIKE/ILIKE. That would allow a search such as William H_ Brewer to match an existing William H. Brewer, though it would not work the other way around.
I don't think it would add that much complexity to enable this using a new operator (e.g. ~_ or something, instead of ~), but I wanted to check before pursing this at all.
Testing
NOTE: testing, or at least the case-insensitive aspect of it, relies on the one line change found in internetarchive/infogami#217. Without that merged, tests (against PostgreSQL) relating to case-insensitivity will fail. The unit tests, however, will pass, as they rely on an updated mock_infobase, as frustrating as that may be.
The tests should lay out a fairly comprehensive listing of the cases covered. However, those do rely on mock_site(), which, try as I might, may not fully replicate ILIKE.
The last commit adds a /api/author endpoint solely for testing purposes. It is NOT intended for merging.
The idea there is just to make it easier to test the author-specific part of build_query without having to actually import things -- one can simply see what the result would be.
Sample use:
❯ curl -X POST http://localhost:8080/api/author \
-H "Content-Type: application/json" \
-b ~/cookies.txt \
-d '{
"authors": [{"name": "Mr. ePictEtus"}]
}'
{'type': {'key': '/type/edition'}, 'authors': [<Author: '/authors/OL5A'>]}
Stakeholders
@tfmorris @seabelis @cdrini