openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Orphans no longer appear in search results

Open seabelis opened this issue 1 year ago • 13 comments

Problem

I'm not sure when it started (I first noticed in this past week). Orphaned editions do not appear in search results in either mode (solr editions search on/off).

Evidence / Screenshot

Relevant URL(s)

Reproducing the bug

I don't really have an example, because I can't find an orphan. There is a case where if there's only one record with an isbn and you search for the isbn you will navigate directly to the record. If you try to force the record appear in search results, for example, by searching ISBN and some other text string, it will not. This used to be possible.

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N): Y
  • Environment (prod, dev, local): prod

Notes from this Issue's Lead

Proposal & constraints

Related files

Stakeholders

@cdrini


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

seabelis avatar Aug 08 '24 10:08 seabelis

@seabelis can you please assign this issue to me?

AbhinavKRN avatar Aug 12 '24 19:08 AbhinavKRN

@AbhinavKRN Before we make progress on this we'll need some more information. Please take a look at other good first issues.

cdrini avatar Aug 19 '24 13:08 cdrini

@seabelis (or perhaps @hornc ?) having an example would be super helpful here! I'm also having trouble finding any orphan right now to debug :P Could one of you give a few examples of some orphaned editions?

The rough strategy is:

  1. Find an orphaned edition
  2. Confirm that searching for it through various means doesn't work on the prod site (try key:"/works/OL...M")
  3. Drini: Confirm that searching for it directly via solr also find no matches
  4. Locally: monitor solr-updater logs docker compose logs --tail=10 -f solr-updater
  5. Copydocs the orphaned edition into the local environment
  6. See if the logs show anything suspect. It should say "1 document updated" or similar
  7. Run the solr update http://localhost:8983/solr/openlibrary/update?commit=true
  8. Check if it appears in local solr directly (in localhost:8983, check if it appears for key:"/works/OL...M" (note the M) and for ``key:"/books/OL...M"`)

cdrini avatar Aug 19 '24 13:08 cdrini

The trouble is finding an example. I may have to create one.

Here's one https://openlibrary.org/books/OL11447467M/Canadian_Parliamentary_Guide_Parlementaire_Canadien_Spring_1991.

seabelis avatar Oct 17 '24 15:10 seabelis

There are just under 2 million of them in the data base. This command will generate a list of them (from the July dump):

gzcat ol_dump_editions_2024-07-31.txt.gz | grep -v '/works/' | cut -f 2

The first and last 10 edition IDs are:

ia:cihm_87923
ia:egwoerterbuchmat13hilg
ia:isbn_9781401912277
ia:lonestar118timbe00wesl
ia:woodturning00barr
OL65M
OL2468M
OL29539M
OL30054M
OL30362M

OL26874667M
OL26874669M
OL26875324M
OL26877007M
OL26879286M
OL26881645M
OL26881648M
OL26882242M
OL26889833M
OL26891755M

tfmorris avatar Oct 17 '24 19:10 tfmorris

The last 10 were entered by hand through the web interface by users in May 2019, so apparently that's the last time that the problem was seen. The early ones date from MARC bot imports in 2008 and were presumably missed by the Work Bot when it first ran.

tfmorris avatar Oct 23 '24 22:10 tfmorris

Ok, I believe I have the root of this, which is that there's a bug preventing them from entering solr during solr-update.

One should be able to repro this by (1) copydocs this record: /books/OL10021046M, then (2) inside the container, run something like: PYTHONPATH=. python ./openlibrary/solr/update.py '/books/OL10021046M' --ol-config '/config/etc/openlibrary.yml'.

The error is likely somewhere around here: https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/openlibrary/solr/update.py#L95

When I tried the above command on prod, I got this:

2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/books/OL10021046M']
0.0 (4): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id  AND thing.key IN ('/books/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (5): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id  AND thing.key IN ('/works/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_editions_of_works ['/works/OL10021046M'] ..
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (6): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id  AND thing.key IN ('/works/OL10021046M')
2025-01-16 03:53:05 [32] [openlibrary.solr.data_provider] [WARNING] NOT FOUND /works/OL10021046M
2025-01-16 03:53:05 [32] [openlibrary.solr] [INFO] '/works/OL10021046M' has type '/type/delete'. queuing for deleting it solr.

At some point it switched the key from the edition key, the pseuo-work key we use in solr for orphans -- and that I think caused get_document to fail -- and I think when it fails, it returns a dummy document with just a /type/delete.

cdrini avatar Jan 16 '25 04:01 cdrini

@tomrod10 Would you be interested in tackling this one? This is another tricky one which will likely require using the debugger to investigate, like your work the the search quotes issue.

Note this one has a different debug flow, you'll have to go into the web container, then run the solr_updater.py command in the above comment with the --debugger option. You should then be able to attach the vs code debugger.

Let me know if you're interested!

cdrini avatar Feb 26 '25 01:02 cdrini

@cdrini Yes, I'm down to take this!

tomrod10 avatar Feb 26 '25 05:02 tomrod10

@cdrini

I'm running the following command in the web container:

PYTHONPATH=. python ./openlibrary/solr/update.py '/books/OL10021046M' --ol-config './config/openlibrary.yml' --ol-url 
'localhost:8080/books/OL10021046M/
The_Fishguard_to_Bangor_Trunk_Road_(A487)_(Llanwnda_to_South_of_Llanllyfni_Improvement)_Order_1999_H?debug=true'

Then I try to attach the debugger in VS Code, then set some break points and lastly send the query for the book mentioned above (OL10021046M or The Fishguard to Bangor Trunk Road) but nothing happens. No breakpoints are hit.

Breakpoint: https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/openlibrary/solr/update.py#L95

Also, I noticed that in dev and prod the result appears when I query OL10021046M or The Fishguard to Bangor Trunk Road. I'm almost certain it didn't show up before 🤔

Logs from running the command above in the web container:

Couldn't find statsd_server section in config
0.0 (1): SELECT * FROM thing LIMIT 1
0.0 (2): SELECT * FROM thing WHERE key='/type/type'
0.0 (3): SELECT * FROM thing WHERE key='/type/type'
2025-03-26 01:14:08 [infobase.ol] [INFO] logging initialized
2025-03-26 01:14:08 [infobase.ol] [DEBUG] debug
2025-03-26 01:14:08 [openlibrary.olbase] [INFO] setting up infobase events for Open Library
2025-03-26 01:14:08 [openlibrary] [INFO] Application init
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin openlibrary
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up requests
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up proxy
2025-03-26 01:14:08 [openlibrary] [INFO] No proxy configuration found
2025-03-26 01:14:08 [openlibrary] [INFO] Setting up proxy bypass
2025-03-26 01:14:08 [openlibrary] [INFO] No proxy bypass configuration found
2025-03-26 01:14:08 [openlibrary] [INFO] Requests set up
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin worksearch
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin inside
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin books
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin admin
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin upstream
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin importapi
2025-03-26 01:14:08 [openlibrary] [INFO] loading plugin recaptcha
2025-03-26 01:14:08 [openlibrary] [INFO] loading complete.
2025-03-26 01:14:08 [infobase] [INFO] loading plugin openlibrary.olbase
2025-03-26 01:14:08 [infobase.ol] [INFO] logging initialized
2025-03-26 01:14:08 [infobase.ol] [DEBUG] debug
2025-03-26 01:14:08 [openlibrary.olbase] [INFO] setting up infobase events for Open Library
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/books/OL10021046M']
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (4): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id  AND thing.key IN ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_editions_of_works ['/works/OL10021046M'] ..
0.0 (1): SELECT edition.key as edition_key, work.key as work_key FROM thing as edition, thing as work, edition_ref WHERE edition_ref.thing_id=edition.id   AND edition_ref.value=work.id   AND edition_ref.key_id=(select id from property where name='works' and type=(select id from thing where key='/type/edition'))   AND work.key in ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [INFO] preload_documents0 ['/works/OL10021046M']
0.0 (5): SELECT thing.key, data.data from thing, data WHERE data.revision = thing.latest_revision and data.thing_id=thing.id  AND thing.key IN ('/works/OL10021046M')
2025-03-26 01:14:08 [openlibrary.solr.data_provider] [WARNING] NOT FOUND /works/OL10021046M
2025-03-26 01:14:08 [openlibrary.solr] [INFO] '/works/OL10021046M' has type '/type/delete'. queuing for deleting it solr.

tomrod10 avatar Mar 26 '25 01:03 tomrod10

Ah, because that script is running separate from the main open library app, you'll need to add the --debugger flag to the script; then it should pause and wait for your debugger to attach.

cdrini avatar Mar 26 '25 04:03 cdrini

Oh sorry, this script isn't wired up with --debugger. Copy the snippet here into this script's main method:

https://github.com/internetarchive/openlibrary/blob/9840d5815377d9db707bbb85a966db0cdda1d471/scripts/solr_updater.py#L262-L269

cdrini avatar Mar 26 '25 17:03 cdrini

@tomrod10 unassigning you, since as you mentioned on slack, you'll be busy with other things for a while :)

@pidgezero-one if you're interested this would be a good one! This has been blocking @seabelis for quite a while.

cdrini avatar May 29 '25 16:05 cdrini