openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Some early imports have a `machine_comment` on the edit pointing to a MARC source that is NOT present in the `source_records` metadata

Open hornc opened this issue 1 month ago • 5 comments

Problem

relates to #11469

example: https://openlibrary.org/books/OL3555082M/Lucy

source_records:
- marc:marc_openlibraries_phillipsacademy/PANO_FOR_IA_05072019.mrc:2809759:1310
- bwb:9780393051537
- marc:marc_loc_2016/BooksAll.2016.part29.utf8:173787268:917
- promise:bwb_daily_pallets_2020-11-27
- marc:marc_columbia/Columbia-extract-20221130-007.mrc:397135905:1282
- marc:harvard_bibliographic_metadata/ab.bib.09.20150123.full.mrc:18846330:1424
- marc:harvard_bibliographic_metadata/20220215_023.bib.mrc:563868:2493

Yet in the history view it shows:

"Imported from Scriblio MARC record" => https://openlibrary.org/show-records/marc_records_scriblio_net/part14.dat:24012253:917

To see the full content of the edit history with the machine_comments you need to use the JSON API: https://openlibrary.org/books/OL3555082M.json?m=history

Ideally:

  • All source records should be reflected in source_records metadata
  • machine_comment should be deprecated / combined or moved to the comment field
  • initial add item comment messages should be helpful

Impact

This affects archive.org metadata fetch.

I thought the lookup API for fetching MARC records only uses the source_records? I'm not sure how it would have access to the edit history, but then this has been in place for a long time. Have we been missing out on these MARC sources because they are hidden in edit comments and not in the records themselves? ANS: Yes.

Test lookups for this example:

  • https://openlibrary.org/api/books.json?bibkeys=isbn:0393051536
  • https://openlibrary.org/api/books.json?bibkeys=lccn:2002010093 archive.org has its own API for this which shows the MARC records ~TODO get a~ sample response for this example and see whether the scriblio MARC is listed. Perhaps find an example with only a machine_comment MARC and no other source_records to illustrate the worst case.

This is an example archive.org needs to scan: https://openlibrary.org/books/OL13443775M/The_cuckoo_clock_and_The_tapestry_room

The DWWI API responce does not include marc_source or marc_data even though there is a machine_comment connecting it to a Boston Public Library MARC: https://openlibrary.org/show-records/bpl_marc/bpl115.mrc:3683740:869

Stakeholders

  • @seabelis
  • @judec
  • @hornc

hornc avatar Nov 27 '25 21:11 hornc

@cdrini Is there a way to get an extract from the edit comments DB for all initial edits to editions that have a machine comment that looks like a source record?

I think the full metadata fix is to go over all of them that were created like this and fill in the initial source record. I can script something to do that, but a list of affected editions and their sources would be helpful.

I'm trying to think of a better way. The most useful ones to fix are edition records without any MARC source_records that have a MARC machine_comment. Those are ones that archive.org would most benefit from having MARC data linked should they be scanned in future.

hornc avatar Nov 27 '25 23:11 hornc

It's unfortunate that none of this information is available in the public dumps (and the recent changes API is restricted so that you can't access the full history.

Note that the machine comments aren't always on v1. There are some like https://openlibrary.org/books/OL20000000M with multiple machine_comment records, where the second machine comment gets rendered as "Found a matching MARC record."

From a quick very non-random sampling, I'd guess that the majority of the first 25M+ editions will meet your criteria, assuming you want to include Amazon source records in addition to MARC.

Ideally, each source record, new or old, should be linked from the edit history so that it's easy to see the dates, etc.

A git-style "blame" feature to extend the current version diff would be a super cool addition.

tfmorris avatar Nov 28 '25 02:11 tfmorris

I was trying to avoid having to edit 25M, but the number is still going to be large. Here's some results from a early Nov 2025 edition dump:

year min_date max_date min_olid max_olid src_marc src_other src_none
UNKN UNKNOWN UNKNOWN OL10005288M OL10006011M 0 0 3,204,513
2008 2008-04-01 2008-12-31 OL1000683M OL22792692M 9,859,902 2,457,450 6,671,818
2009 2009-01-01 2009-12-31 OL22792737M OL23988174M 670,357 257,990 250,231
2010 2010-01-01 2010-12-31 OL23992570M OL24549453M 312,676 214,930 23,602
2011 2011-01-01 2011-12-31 OL24549985M OL25155213M 385,726 180,584 23,932
2012 2012-01-01 2012-12-31 OL25155397M OL25420712M 198,453 47,573 16,929
2013 2013-01-01 2013-12-31 OL25420731M OL25435637M 1,943 2,338 10,070
2014 2014-01-01 2014-12-31 OL25435656M OL25648729M 27,077 154,858 14,995
2015 2015-01-01 2015-12-31 OL25648788M OL25884180M 6,085 27,172 20,650
2016 2016-01-01 2016-12-31 OL25884366M OL26210169M 85,963 159,716 48,457
2017 2017-01-01 2017-12-31 OL26210209M OL26412268M 16,841 153,495 28,319
2018 2018-01-01 2018-12-31 OL26412543M OL26630349M 54,696 66,276 22,958
2019 2019-01-01 2019-12-31 OL26630426M OL27866185M 328,488 595,189 19,195
2020 2020-01-01 2020-12-31 OL27868034M OL31837154M 1,439,035 1,606,233 22,345
2021 2021-01-01 2021-12-31 OL31838531M OL36392434M 122,053 3,566,494 28,798
2022 2022-01-01 2022-12-31 OL36535476M OL45134460M 1,593,384 6,575,402 28,860
2023 2023-01-01 2023-12-31 OL45192907M OL50527406M 1,261,829 3,893,075 35,190
2024 2024-01-01 2024-12-31 OL50527472M OL57389278M 1,728,833 4,384,774 28,761
2025 2025-01-01 2025-11-06 OL57392477M OL60594687M 1,640,811 657,979 29,526
        TOTAL   19,734,152 25,001,528 10,529,149

The min and max OLIDs are fuzzier than I intended because it doesn't take into account the exact time, just date, but it provided an indication for when OLIDs were created.

hornc avatar Dec 01 '25 04:12 hornc

I'm not sure how to interpret this data. Does it say anything about machine_comment records? I didn't think they were included in the editions dump, but only the recent changes API.

tfmorris avatar Dec 01 '25 04:12 tfmorris

@tfmorris No, it doesn't say anything directly about machine_comments. It's just looking at the recorded source_records.

I was checking:

  • my suspicion that recent bulk imports set the source_record on import, so we don't need to worry about those,
  • and also testing whether fixing just the records that have no source_record in metadata would be an efficient way to deal with the problem.

The data shows that there are about 10M records without any source_record from the earliest days of OL which are likely to be those imported with machine_comments. Unknown date records seem to be early ~2008/2009 anonymous bulk imports too, and most I've checked have machine_comments. Some are user added editions across all years, but those barely register compared to the bulk imports.

I'm pleasantly surprised that the problem seems to be limited to 2008 imports, but also surprised that so many of OL's total records are still from 2008 (the 25M number you identified).

It's good to see that about half of the early records are MARC sourced. Many (most?) of the un-sourced records have MARC's in their machine comments, so we can increase that.

I plan to extract the machine comments from just the 10M no_source records, at least to start with, then look at expanding non-MARC sources if I can find a MARC reference.

Records with MARC sources are most valuable for metadata lookups for archive.org scanning and anyone else wanting full MARC, so filling in those gaps is my priority. I'm treating one library-quality MARC as good as any other (unless one is obviously corrupted), so filling in records that already have one won't change that outcome. Non-MARC, light bib data tends to be fully captured in the OL record, so linking is less valuable.

hornc avatar Dec 01 '25 05:12 hornc