Some early imports have a `machine_comment` on the edit pointing to a MARC source that is NOT present in the `source_records` metadata
Problem
relates to #11469
example: https://openlibrary.org/books/OL3555082M/Lucy
source_records:
- marc:marc_openlibraries_phillipsacademy/PANO_FOR_IA_05072019.mrc:2809759:1310
- bwb:9780393051537
- marc:marc_loc_2016/BooksAll.2016.part29.utf8:173787268:917
- promise:bwb_daily_pallets_2020-11-27
- marc:marc_columbia/Columbia-extract-20221130-007.mrc:397135905:1282
- marc:harvard_bibliographic_metadata/ab.bib.09.20150123.full.mrc:18846330:1424
- marc:harvard_bibliographic_metadata/20220215_023.bib.mrc:563868:2493
Yet in the history view it shows:
"Imported from Scriblio MARC record" => https://openlibrary.org/show-records/marc_records_scriblio_net/part14.dat:24012253:917
To see the full content of the edit history with the machine_comments you need to use the JSON API: https://openlibrary.org/books/OL3555082M.json?m=history
Ideally:
- All source records should be reflected in
source_recordsmetadata machine_commentshould be deprecated / combined or moved to thecommentfield- initial add item
commentmessages should be helpful
Impact
This affects archive.org metadata fetch.
I thought the lookup API for fetching MARC records only uses the source_records?
I'm not sure how it would have access to the edit history, but then this has been in place for a long time. Have we been missing out on these MARC sources because they are hidden in edit comments and not in the records themselves? ANS: Yes.
Test lookups for this example:
- https://openlibrary.org/api/books.json?bibkeys=isbn:0393051536
- https://openlibrary.org/api/books.json?bibkeys=lccn:2002010093
archive.org has its own API for this which shows the MARC records ~TODO get a~ sample response for this example and see whether the scriblio MARC is listed. Perhaps find an example with only a
machine_commentMARC and no othersource_recordsto illustrate the worst case.
This is an example archive.org needs to scan: https://openlibrary.org/books/OL13443775M/The_cuckoo_clock_and_The_tapestry_room
The DWWI API responce does not include marc_source or marc_data even though there is a machine_comment connecting it to a Boston Public Library MARC: https://openlibrary.org/show-records/bpl_marc/bpl115.mrc:3683740:869
Stakeholders
- @seabelis
- @judec
- @hornc
@cdrini Is there a way to get an extract from the edit comments DB for all initial edits to editions that have a machine comment that looks like a source record?
I think the full metadata fix is to go over all of them that were created like this and fill in the initial source record. I can script something to do that, but a list of affected editions and their sources would be helpful.
I'm trying to think of a better way. The most useful ones to fix are edition records without any MARC source_records that have a MARC machine_comment. Those are ones that archive.org would most benefit from having MARC data linked should they be scanned in future.
It's unfortunate that none of this information is available in the public dumps (and the recent changes API is restricted so that you can't access the full history.
Note that the machine comments aren't always on v1. There are some like https://openlibrary.org/books/OL20000000M with multiple machine_comment records, where the second machine comment gets rendered as "Found a matching MARC record."
From a quick very non-random sampling, I'd guess that the majority of the first 25M+ editions will meet your criteria, assuming you want to include Amazon source records in addition to MARC.
Ideally, each source record, new or old, should be linked from the edit history so that it's easy to see the dates, etc.
A git-style "blame" feature to extend the current version diff would be a super cool addition.
I was trying to avoid having to edit 25M, but the number is still going to be large. Here's some results from a early Nov 2025 edition dump:
| year | min_date | max_date | min_olid | max_olid | src_marc | src_other | src_none |
|---|---|---|---|---|---|---|---|
| UNKN | UNKNOWN | UNKNOWN | OL10005288M | OL10006011M | 0 | 0 | 3,204,513 |
| 2008 | 2008-04-01 | 2008-12-31 | OL1000683M | OL22792692M | 9,859,902 | 2,457,450 | 6,671,818 |
| 2009 | 2009-01-01 | 2009-12-31 | OL22792737M | OL23988174M | 670,357 | 257,990 | 250,231 |
| 2010 | 2010-01-01 | 2010-12-31 | OL23992570M | OL24549453M | 312,676 | 214,930 | 23,602 |
| 2011 | 2011-01-01 | 2011-12-31 | OL24549985M | OL25155213M | 385,726 | 180,584 | 23,932 |
| 2012 | 2012-01-01 | 2012-12-31 | OL25155397M | OL25420712M | 198,453 | 47,573 | 16,929 |
| 2013 | 2013-01-01 | 2013-12-31 | OL25420731M | OL25435637M | 1,943 | 2,338 | 10,070 |
| 2014 | 2014-01-01 | 2014-12-31 | OL25435656M | OL25648729M | 27,077 | 154,858 | 14,995 |
| 2015 | 2015-01-01 | 2015-12-31 | OL25648788M | OL25884180M | 6,085 | 27,172 | 20,650 |
| 2016 | 2016-01-01 | 2016-12-31 | OL25884366M | OL26210169M | 85,963 | 159,716 | 48,457 |
| 2017 | 2017-01-01 | 2017-12-31 | OL26210209M | OL26412268M | 16,841 | 153,495 | 28,319 |
| 2018 | 2018-01-01 | 2018-12-31 | OL26412543M | OL26630349M | 54,696 | 66,276 | 22,958 |
| 2019 | 2019-01-01 | 2019-12-31 | OL26630426M | OL27866185M | 328,488 | 595,189 | 19,195 |
| 2020 | 2020-01-01 | 2020-12-31 | OL27868034M | OL31837154M | 1,439,035 | 1,606,233 | 22,345 |
| 2021 | 2021-01-01 | 2021-12-31 | OL31838531M | OL36392434M | 122,053 | 3,566,494 | 28,798 |
| 2022 | 2022-01-01 | 2022-12-31 | OL36535476M | OL45134460M | 1,593,384 | 6,575,402 | 28,860 |
| 2023 | 2023-01-01 | 2023-12-31 | OL45192907M | OL50527406M | 1,261,829 | 3,893,075 | 35,190 |
| 2024 | 2024-01-01 | 2024-12-31 | OL50527472M | OL57389278M | 1,728,833 | 4,384,774 | 28,761 |
| 2025 | 2025-01-01 | 2025-11-06 | OL57392477M | OL60594687M | 1,640,811 | 657,979 | 29,526 |
| TOTAL | 19,734,152 | 25,001,528 | 10,529,149 |
The min and max OLIDs are fuzzier than I intended because it doesn't take into account the exact time, just date, but it provided an indication for when OLIDs were created.
I'm not sure how to interpret this data. Does it say anything about machine_comment records? I didn't think they were included in the editions dump, but only the recent changes API.
@tfmorris No, it doesn't say anything directly about machine_comments. It's just looking at the recorded source_records.
I was checking:
- my suspicion that recent bulk imports set the
source_recordon import, so we don't need to worry about those, - and also testing whether fixing just the records that have no
source_recordin metadata would be an efficient way to deal with the problem.
The data shows that there are about 10M records without any source_record from the earliest days of OL which are likely to be those imported with machine_comments. Unknown date records seem to be early ~2008/2009 anonymous bulk imports too, and most I've checked have machine_comments. Some are user added editions across all years, but those barely register compared to the bulk imports.
I'm pleasantly surprised that the problem seems to be limited to 2008 imports, but also surprised that so many of OL's total records are still from 2008 (the 25M number you identified).
It's good to see that about half of the early records are MARC sourced. Many (most?) of the un-sourced records have MARC's in their machine comments, so we can increase that.
I plan to extract the machine comments from just the 10M no_source records, at least to start with, then look at expanding non-MARC sources if I can find a MARC reference.
Records with MARC sources are most valuable for metadata lookups for archive.org scanning and anyone else wanting full MARC, so filling in those gaps is my priority. I'm treating one library-quality MARC as good as any other (unless one is obviously corrupted), so filling in records that already have one won't change that outcome. Non-MARC, light bib data tends to be fully captured in the OL record, so linking is less valuable.