Changes to Author names are not updated in Solr's index of Works
from @LeadSongDog commenting on #349
For instance, this morning I did some changes at OL35246A, including disinverting his name. As you can see, it still shows the inverted author name for each work listed at https://openlibrary.org/authors/OL35246A/William_Henry_Brown
Further, the works shown on the author page do not reflect changes that do show when you click through, e.g. the 1963 "Organic chemisty" [sic] which was by a different author entirely, as is now seen at https://openlibrary.org/works/OL16120915W/Organic_chemistry
The edition counts shown on the Author page do not agree with (they lag) the edition counts shown on the listed Work pages. The "What work is this an edition of" dropdown has the same lagging edition count.
Searching either "Title" or "All" for "Organic chemisty" still shows the author as William Henry Brown vice the now-corrected Frederick G. Bordwell which shows once you click on the found work.
Currently the code that updates author information in Solr does not trigger any reindexes of Works, which accounts for 1,2, and 4. 3 seems like an inverse of this, Author data needs to be updated when Work/Editions change. [TODO: create an issue for Author edition counts]
Proposal:
If any author field that is indexed against Works is changed on the Author record, all Works of that Author need to be re-indexed (/updated)
The following Author fields are stored against Works in Solr:
<field name="author_key" type="string" multiValued="true"/>
<field name="author_name" type="textgen" multiValued="true"/>
<field name="author_alternative_name" type="textgen" multiValued="true"/>
taken from https://github.com/internetarchive/openlibrary/blob/bdb69c2a3be687f7b4f60fd6c41b03d475a2e331/conf/solr/conf/schema.xml#L480
I've also labeled this high priority since it appears a very critical bug to Open Library's mission.
@hornc Was a separate issue made for Author edition counts? Also are you or @tfmorris willing to be assignee for this issue? Note, being the assignee doesn't necessarily mean you are responsible for doing the work, just responsible for gathering/providing information to address the issue. From the Wiki.
The assigned owner is not necessarily the person who will fix the issue (it is not necessarily even established, at that point, if or when the issue will be fixed at all), but rather they are the person who will do as much or as little as needed to handle the issue (asking questions, soliciting input, establishing and updating the priority, checking if it is a duplicate, etc).
Once an issue is labeled State: Work In Progress, the owner is the individual doing the work, or leading/coordinating the group that is doing the work.
#628 is related. It sounds like @hornc may have some ideas for where the problems are located, but if he doesn't get around to it first, I'll add this to my pile when I've got some of the more basic Solr issues fixed.
Making this a sub-task of #789. Assigning @cdrini per slack discussions.
Maybe we've missed the point here. Why are author names in the index of works to begin with, rather than just author identifiers?
Maybe we've missed the point here. Why are author names in the index of works to begin with, rather than just author identifiers?
Because that way they can be returned with the search results in a single query rather than having to do many queries.
@tfmorris I get that it’s faster, but it’s wrong. Patrons should not have to guess at author name spellings, and there are many variations in the data. That is why we have author identifiers. [update] If the author names must be shown in the work records, then they should ALL be there, which means an edit to the A.K.A.s would have to affect all the records for all the works by that author. It's too ugly an option to consider seriously.
@hornc Mind stating where the blocker is, please?
@leadsongdog I'm not sure of the current state, but when I added the blocked tag 2 years ago I believe we had decided that attempting to fix individual Solr indexing issues needed to wait until we had upgraded to a modern version and put some better logging / debugging in place.
I think you have a good point that (one?) Author name should likely not be indexed with the work. This issue should be changed to something like "Solr stores an initial (unnecessary) Author name on Works which will not stay in sync"
@hornc So then the blockers include #3317?
This looks like recent info on the Solr status: https://coda.io/d/Search-Planning-Notes_dO8sGM90quA/Epics-in-Progress_su1I2#_luBNR which I found next to the 'Improving Search' project
Testing whether this is still an issue:
correct cased the author name of https://openlibrary.org/authors/OL12521441A/Charles_O._Hardy
immediately after making the change:
and it hasn't changed for 2 minutes. I'll monitor if it changes.
Now that Solr updates are more reliable, it seems clear that there is no attempt to re-index affected works after an author change.
Testing whether this is still an issue:
correct cased the author name of https://openlibrary.org/authors/OL12521441A/Charles_O._Hardy
immediately after making the change:
and it hasn't changed for 2 minutes. I'll monitor if it changes.
That instance is indeed one where both the edition and the work records link to the author record, and where the edition record is not revised (to eliminate the redundant author link) nor is it promptly refreshed in cache.
Coincidentally, this is only one of eight records for that same author found by: https://openlibrary.org/search/authors?q=Hardy+Charles+o*&mode=everything I have to think we are missing out by not having author pages and work pages show links to (or perhaps a count of) synonymous author/work pages.
Page caching can be an issue, but the JSON Search API shows the same stale author name 13 hours later
https://openlibrary.org/search.json?q=OL47061593M
and the search result has a timestamp of last_modified_i: 1677582478 or 28 Feb 2023 which makes it clear that the Solr index hasn't been updated.
Of course there is still the problem that there is no UI available to correct authors shown redundantly and sometimes incorrectly in edition records. #3413 and #2625 refer. There might be justification for an edition record to quote the author as named in the imported source, but why does it need an unmaintainable direct link to the author record when the author is already linked from the work record? Do we expect that different William Shakespeares will author different editions of The Tempest?
Howdy! So I should note this is effectively a "by design" issue of our solr updater. Whoever put this code in place many many years ago made the decision that when an author is edited, all their works/editions are not reindexed into solr. Only the author record itself is reindexed. My guess would be this was a performance optimization, since updating all an author's works/editions would make a pretty hefty reindex request.
(It might've also been prevented to avoid an infinite update loop, perhaps, since works/editions trigger author reindex. If an author reindex also triggers works/editions reindexing, that could create a bug resulting in an infinite loop of reindexing. Easily enough fixed ofc, but I don't know what the original reasoning was.)
Whether the assumptions that were present then are still valid today, or whether there are any performance issues, is still an open question that needs investigation. That's largely the core task of this issue: investigate enabling work/edition reindexing on author edit, and see whether that causes performance issues.
But the "problem" lies in solr updater, and is not related to #3413 #2625 .
I was referring not to the updater per se, but the existence of uncorrectable broken linkages from edition records to author records. Either the edition records should lose the author link when the work link is created or else the author link should be editable so that it can be corrected. Unless I am mistaken, the former involves far less redundant human effort.
Yep! I was noting that the author stored on the edition record does not have an impact on this issue, "Changes to Author names are not updated in Solr's index of Works". Changes to that will not fix this. But otherwise 100% agree a strategy to fix #2625 would be great!
Whether the assumptions that were present then are still valid today, or whether there are any performance issues, is still an open question that needs investigation.
Hopefully @mekarpeles agrees the an accurate representation in the UI of the contents of the database is critically important to users and, since that UI is largely driven by search results, the Solr updater needs to be fixed. Any metering of updates to mitigate performance issues or algorithmic fixes to avoid infinite loops or other things are just technical details for the engineering team to work out.
@LeadSongDog edition authors don't have anything to do with this.
@LeadSongDog edition authors don't have anything to do with this.
They are certainly a problem when attempting to merge author records. When work records show one author link and the editions hide another, the merge cannot be completed.
In far too many cases these editions and the respective edition-authors are the waste products of AMZ/BWB/promise imports from defective source records, which only makes the whole thing more frustrating. It is work that never should have been necessary if common sense were common.
