Orphans no longer appear in search results
Problem
I'm not sure when it started (I first noticed in this past week). Orphaned editions do not appear in search results in either mode (solr editions search on/off).
Evidence / Screenshot
Relevant URL(s)
Reproducing the bug
I don't really have an example, because I can't find an orphan. There is a case where if there's only one record with an isbn and you search for the isbn you will navigate directly to the record. If you try to force the record appear in search results, for example, by searching ISBN and some other text string, it will not. This used to be possible.
Context
- Browser (Chrome, Safari, Firefox, etc):
- OS (Windows, Mac, etc):
- Logged in (Y/N): Y
- Environment (prod, dev, local): prod
Notes from this Issue's Lead
Proposal & constraints
Related files
Stakeholders
@cdrini
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
@seabelis can you please assign this issue to me?
@AbhinavKRN Before we make progress on this we'll need some more information. Please take a look at other good first issues.
@seabelis (or perhaps @hornc ?) having an example would be super helpful here! I'm also having trouble finding any orphan right now to debug :P Could one of you give a few examples of some orphaned editions?
The rough strategy is:
- Find an orphaned edition
- Confirm that searching for it through various means doesn't work on the prod site (try
key:"/works/OL...M") - Drini: Confirm that searching for it directly via solr also find no matches
- Locally: monitor solr-updater logs
docker compose logs --tail=10 -f solr-updater - Copydocs the orphaned edition into the local environment
- See if the logs show anything suspect. It should say "1 document updated" or similar
- Run the solr update http://localhost:8983/solr/openlibrary/update?commit=true
- Check if it appears in local solr directly (in localhost:8983, check if it appears for
key:"/works/OL...M"(note theM) and for ``key:"/books/OL...M"`)
The trouble is finding an example. I may have to create one.
Here's one https://openlibrary.org/books/OL11447467M/Canadian_Parliamentary_Guide_Parlementaire_Canadien_Spring_1991.
There are just under 2 million of them in the data base. This command will generate a list of them (from the July dump):
gzcat ol_dump_editions_2024-07-31.txt.gz | grep -v '/works/' | cut -f 2
The first and last 10 edition IDs are:
ia:cihm_87923
ia:egwoerterbuchmat13hilg
ia:isbn_9781401912277
ia:lonestar118timbe00wesl
ia:woodturning00barr
OL65M
OL2468M
OL29539M
OL30054M
OL30362M
OL26874667M
OL26874669M
OL26875324M
OL26877007M
OL26879286M
OL26881645M
OL26881648M
OL26882242M
OL26889833M
OL26891755M
The last 10 were entered by hand through the web interface by users in May 2019, so apparently that's the last time that the problem was seen. The early ones date from MARC bot imports in 2008 and were presumably missed by the Work Bot when it first ran.
Ok, I believe I have the root of this, which is that there's a bug preventing them from entering solr during solr-update.
One should be able to repro this by (1) copydocs this record: /books/OL10021046M, then (2) inside the container, run something like: PYTHONPATH=. python ./openlibrary/solr/update.py '/books/OL10021046M' --ol-config '/config/etc/openlibrary.yml'.
The error is likely somewhere around here: https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/openlibrary/solr/update.py#L95
When I tried the above command on prod, I got this:
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/books/OL10021046M']
0.0 (4): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id AND thing.key IN ('/books/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (5): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id AND thing.key IN ('/works/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_editions_of_works ['/works/OL10021046M'] ..
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (6): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id AND thing.key IN ('/works/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [WARNING] NOT FOUND /works/OL10021046M
2025-01-16 03:53:05 [32] [openlibrary.solr] [INFO] '/works/OL10021046M' has type '/type/delete'. queuing for deleting it solr.
At some point it switched the key from the edition key, the pseuo-work key we use in solr for orphans -- and that I think caused get_document to fail -- and I think when it fails, it returns a dummy document with just a /type/delete.
@tomrod10 Would you be interested in tackling this one? This is another tricky one which will likely require using the debugger to investigate, like your work the the search quotes issue.
Note this one has a different debug flow, you'll have to go into the web container, then run the solr_updater.py command in the above comment with the --debugger option. You should then be able to attach the vs code debugger.
Let me know if you're interested!
@cdrini Yes, I'm down to take this!
@cdrini
I'm running the following command in the web container:
PYTHONPATH=. python ./openlibrary/solr/update.py '/books/OL10021046M' --ol-config './config/openlibrary.yml' --ol-url
'localhost:8080/books/OL10021046M/
The_Fishguard_to_Bangor_Trunk_Road_(A487)_(Llanwnda_to_South_of_Llanllyfni_Improvement)_Order_1999_H?debug=true'
Then I try to attach the debugger in VS Code, then set some break points and lastly send the query for the book mentioned above (OL10021046M or The Fishguard to Bangor Trunk Road) but nothing happens. No breakpoints are hit.
Breakpoint: https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/openlibrary/solr/update.py#L95
Also, I noticed that in dev and prod the result appears when I query OL10021046M or The Fishguard to Bangor Trunk Road. I'm almost certain it didn't show up before 🤔
Logs from running the command above in the web container:
Couldn't find statsd_server section in config
0.0 (1): SELECT * FROM thing LIMIT 1
0.0 (2): SELECT * FROM thing WHERE key='/type/type'
0.0 (3): SELECT * FROM thing WHERE key='/type/type'
2025-03-26 01:14:08 [infobase.ol] [INFO] logging initialized
2025-03-26 01:14:08 [infobase.ol] [DEBUG] debug
2025-03-26 01:14:08 [openlibrary.olbase] [INFO] setting up infobase events for Open Library
2025-03-26 01:14:08 [openlibrary] [INFO] Application init
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin openlibrary
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up requests
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up proxy
2025-03-26 01:14:08 [openlibrary] [INFO] No proxy configuration found
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up proxy bypass
2025-03-26 01:14:08 [openlibrary] [INFO] No proxy bypass configuration found
2025-03-26 01:14:08 [openlibrary] [INFO] Requests set up
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin worksearch
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin inside
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin books
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin admin
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin upstream
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin importapi
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin recaptcha
2025-03-26 01:14:08 [openlibrary] [INFO] loading complete.
2025-03-26 01:14:08 [infobase] [INFO] loading plugin openlibrary.olbase
2025-03-26 01:14:08 [infobase.ol] [INFO] logging initialized
2025-03-26 01:14:08 [infobase.ol] [DEBUG] debug
2025-03-26 01:14:08 [openlibrary.olbase] [INFO] setting up infobase events for Open Library
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/books/OL10021046M']
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (4): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id AND thing.key IN ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_editions_of_works ['/works/OL10021046M'] ..
0.0 (1): SELECT edition.key as edition_key, work.key as work_key FROM thing as edition, thing as work, edition_ref WHERE edition_ref.thing_id=edition.id AND edition_ref.value=work.id AND edition_ref.key_id=(select id from property where name='works' and type=(select id from thing where key='/type/edition')) AND work.key in ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (5): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id AND thing.key IN ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [WARNING] NOT FOUND /works/OL10021046M
2025-03-26 01:14:08 [openlibrary.solr] [INFO] '/works/OL10021046M' has type '/type/delete'. queuing for deleting it solr.
Ah, because that script is running separate from the main open library app, you'll need to add the --debugger flag to the script; then it should pause and wait for your debugger to attach.
Oh sorry, this script isn't wired up with --debugger. Copy the snippet here into this script's main method:
https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/scripts/solr_updater.py#L262-L269
@tomrod10 unassigning you, since as you mentioned on slack, you'll be busy with other things for a while :)
@pidgezero-one if you're interested this would be a good one! This has been blocking @seabelis for quite a while.