openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Fix librarian work title and author name sorting. Fixes #11430

Open tfmorris opened this issue 1 month ago • 1 comments

Closes #11430. Reverses damage from #11211. Modifies #11188.

fix

Technical

Fixes:

  1. Uses ICUCollationField with an empty locale to do generic multi-lingual collation for work titles. Since only the collation key is stored, the storage requirements are minimal. Moves title article removal from Solr filter to Python Solr updater because no analysis pipeline is allowed for the above.
  2. Uses the same type of collation field for author names, but without any special pre-processing

Related changes:

  1. Incorporated diffs from Solr 9.4->9.9 into solrconfig.xml so that OL config is based on the current default config. Modifies what was done in #11188
  2. Made configuration of analysis-extras consistent across all environments. Also refs #11188.
  3. Removed unused fields name_str and title_suggest. The latter saves 3.4 GB for the current 55 M editions and the former eliminates the duplicate name storage for 15 M authors saving another 300+ MB.

Because the title pre-processing is now done in Python where more context is available, a potential future enhancement would be to only remove articles for the language that the title is in, rather than using a generic multi-lingual regex which will have both false positives and false negatives. Note that language specific collation is likely not an option since it would depend on all titles being in the same language.

Testing

Use standard Solr post-reindexing test suite.

Stakeholders

@mheiman

~Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.~

I don't understand what the above means. All code was authored by me personally.

tfmorris avatar Nov 12 '25 20:11 tfmorris

@mekarpeles @cdrini Is there a problem with this PR? If so, can you outline what needs changing? What do I need to do to get it reviewed and merged? It's been open a month and no one has even looked at it yet, which doesn't encourage me to invest in the followup work that I had intended to fix other search problems.

tfmorris avatar Dec 11 '25 20:12 tfmorris

Howdy @tfmorris , apologies for the delay. I unfortunately have a rather large backlog of prs, and have been pulled away fighting a lot of our performance issues as a result of abusive traffic. I'll try to get a code review for you next week 👍 I will note, that smaller, single issue/purpose PRs are the best way to help us get through PRs quickly; they're easier to test and lower risk, so they move through much faster.

cdrini avatar Dec 13 '25 05:12 cdrini

Thanks for the reply.

I will note, that smaller, single issue/purpose PRs are the best way to help us get through PRs quickly; they're easier to test and lower risk, so they move through much faster.

I'm happy to split this up if you provide guidance on how you'd like it divided. There were two bugs introduced in PR #11211 which this fixes. Issue #11430 opened by @mheiman documents one of the two bugs, but I chose to fix both together, since they were both introduced in the same PR. Ironically, that PR got reviewed, approved, and "auto-merged" in under 24 hours.

Let me know how you'd like it split up. All the changes are in separate commits, so it's easy enough to turn this into 6 separate PRs.

tfmorris avatar Dec 13 '25 21:12 tfmorris