openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

MARC records listed as source records not being used (or used fully?)

Open tfmorris opened this issue 1 year ago • 5 comments

Problem

When investigating editions records with no publishers for #2119, I noticed cases where the source_records lists MARC records which contain publishers in the MARC 260, but it's not being added to the record

This edition: https://openlibrary.org/books/OL10298M?m=history was imported from a rather threadbare Scriblio MARC record: https://openlibrary.org/show-records/marc_records_scriblio_net/part29.dat:6096809:617 but a later import claims to have used a much richer Columbia MARC https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_..._Amparp_pura%CC%84n%CC%A3am?b=5&a=4&_compare=Comparer&m=diff yet didn't pull in the publisher from there.

Additionally, the original record elided words from the title, so a search on the full title returns zero hits, but I'm not sure there's a good way to detect and correct for that case.

The second example: https://openlibrary.org/books/OL12026877M also can't be found by title, but because it was imported from a threadbare (and incorrect) Amazon record with a typo in it. Despite "importing" from four higher quality MARC records, all containing the correct title and a fully populated MARC 260 Publisher field, neither the missing publisher field nor the incorrect title were updated.

Before trying to guess publishers based on ISBN, the high quality metadata that's already available should be fully exploited.

Reproducing the bug

  1. Go to ...
  2. Do ...
  • Expected behavior:
  • Actual behavior:

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

See this comment: https://github.com/internetarchive/openlibrary/issues/9831#issuecomment-2351781827

Requirements Checklist

Taken from https://github.com/internetarchive/openlibrary/issues/9831#issuecomment-2351781827:

  • [ ] identify no-publishers records with MARC sources
  • [ ] re-import those MARC records (once #9808 has been deployed to production)
  • [ ] the records should be correctly matched, and any existing publisher metadata will be updated using the latest import code

Related files

Stakeholders

  • @hornc

Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

tfmorris avatar Aug 30 '24 03:08 tfmorris

Should have included @hornc in the stakeholders - updated.

tfmorris avatar Aug 30 '24 03:08 tfmorris

I think the publishers field would be supplemented now, but if not, that seems like a straightforward change and I suspect may be uncontroversial.

To clarify on the specific suggestion, @tfmorris, do you mean adding publishers to existing records solely in the case when the re-import source is a MARC record (if they're not already being added more broadly)?

Additionally, with respect to changing the title of an existing record, how do others feel about this? For my part, I think I'm willing to defer to a MARC record when it comes to clobbering other records, for the title field at least. @seabelis, @hornc, @cdrini?

One approach to limit the blast radius might be to take into account the original source, when it comes to whether a MARC record should clobber a title field, though that may just make things more confusing to work with.

scottbarnes avatar Sep 13 '24 15:09 scottbarnes

It seems like there's a high error rate with matching these MARC imports to existing records. I'd not be in favour of modifying records based on them (but don't we already do that now?).

seabelis avatar Sep 13 '24 15:09 seabelis

#9808 may be responsive to the matching issue, but perhaps not. I am unsure the full extent of it.

scottbarnes avatar Sep 13 '24 16:09 scottbarnes

I agree that #9808 should make existing record matching considerably better -- that fixed a longstanding issue whereby records were frequently matched just on title only (ignoring subtitle and any other metadata). These matches were made before even attempting the more sophisticated threshold matching code that exists and has tests in the codebase.

publishers from new records should currently be added to matched existing records if they are blank; (I had to search for the code I thought/hoped existed):

https://github.com/internetarchive/openlibrary/blob/f64cab54045351216cd22b961691d7946ecc0a14/openlibrary/catalog/add_book/init.py#L834-L840

publishers were added to this list in Feb 2023 in this commit https://github.com/internetarchive/openlibrary/commit/f6268b647eb8e783e0ef3f1203153a65e64c9c96

, which is after the reported example where publisher wasn't added in Aug 2022: https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_S%CC%81ri%CC%84_Mi%CC%84n%CC%B2a%CC%84t%CC%A3cicuntaram_Pil%CC%A3l%CC%A3ai_avarkal%CC%A3_iyar%CC%B2r%CC%B2iya_Tiru_Amparp_pur?b=5&a=4&_compare=Comparer&m=diff

I believe the code does the correct thing now, but only since 2023, so there will be many examples where it has been missed.

If we wanted to populate missing publishers :

  • identify no-publishers records with MARC sources
  • re-import those MARC records (once #9808 has been deployed to production)
  • the records should be correctly matched, and any existing publisher metadata will be updated using the latest import code

hornc avatar Sep 15 '24 20:09 hornc