openlibrary Promise item imports need to augment metadata by any ASIN/ISBN10 if only title + ASIN is provided

Problem

https://openlibrary.org/books/OL51751249M/25_Melodic_and_Progressive_Studies was imported 12 hours ago.

It has no date, author, or publisher and is exactly the kind of record which is causing problems with imports as reported in #8977 and with scanning metadata, and should have been fixed with #8903 / #9030

This record has an ISBN-10 , and Amzon has additional metadata (date, author, and publisher) associated with the ISBN: 0825699770 https://www.amazon.com/dp/0825699770

Why weren't these added?

Here is another example: https://openlibrary.org/books/OL51751279M/The_Adventures_of_Tom_Sawyer(Unabridged)

No date and no author, but metadata is available: https://www.amazon.com/dp/7500144237

ImportBot has many in its recent history, newly created 'publisher unknown' records.

Ah... I see the problem, the changes in the last PR only applied to non-ISBN ASINs.

The important aspect is to lookup missing metadata by ASIN if there is only an ASIN in the promise item.

We need a modification of #8903 where the record is augmented if the import has insufficient metadata to be useful for matching future imports or matching for scanned metadata.

Notes from this Issue's Lead

Proposal & constraints

As an Open Librarian, in order to ensure the catalog is useful for unambiguous matching to support all basic search/match based usecases of:

material discovery by patrons,
re-imports,
scanned item metadata population

I require that only 'differentiable' and 'sane' records are imported.

Additionally, I'd also like every imported record to eventually be 'complete'.

Definitions:

To be 'complete', a record:

MUST have a title. MUST have an author or statement of responsibility IF one exists MUST have a date IF one exists SHOULD have a publisher SHOULD have one or more strong identifiers IF they are known MAY have one or more other identifiers MAY have other metadata, depending on the source

A 'differentiable' record has enough metadata to identify which edition it represents. It must be either 'complete', or have a 'strong identifier' which could be used to reliably locate metadata to make it complete.

A 'strong identifier' is one such as an ISBN, LCCN, OCN, (YMMV -- edition vs. work level representation and other practical issues may need to be considered here. ISBN is a clear good example, with multiple sources, and check digit reliability) which can generally be relied upon to locate all possible metadata for that edition.

Bookseller only identifiers (such as B0* ASINs) are not generally reliable enough for this purpose. They frequently do not persist over time. A bookseller identifier may happen to locate enough metadata to complete a record, but if it can't be done now, it's unlikely to give better results in future. Whether an ASIN is directly from Amazon or a 3rd party seller may have some bearing on how useful that ASIN is, but in general they are still less reliable than a strong identifier.

A 'sane' record is one where every populated field also passes a basic sanity check that it contains an accurate value. It is not good enough that each of the fields merely be populated. Records that combine all metadata into the title field (author + date + other notes or extraneous content) should fail. Also unrealistic dates such as 0000, 1000, 9000, 9999 etc will also fail.

Related files

Import candidate Completeness and id type:

Suggested workflow: import-flow drawio

Stakeholders

@scottbarnes @mekarpeles

Jun 16 '24 13:06 hornc

can I work on this issue? @hornc @mekarpeles

Jun 18 '24 09:06 hollermay

@hollermay, thank you for your interest in this issue. I think this issue will probably be tackled by a staff member, and we need to finalized what we want to do first, so I am going to change the labels for the issue for now. It may be worth looking at Good First Issues to see if any of those interest you.

Jun 21 '24 16:06 scottbarnes

Per a discussion with @hornc and @judec, we will want to keep statistics about when BWB has incomplete data, and when AMZ has incomplete data. At this moment it's unclear whether this would be a separate PR, but I am writing this here to capture the need.

Jul 16 '24 20:07 scottbarnes

#6555 should be revisited after this is completed.

Jul 18 '24 17:07 scottbarnes

Per a discussion with @hornc and @judec, we will want to keep statistics about when BWB has incomplete data, and when AMZ has incomplete data. At this moment it's unclear whether this would be a separate PR, but I am writing this here to capture the need.

@scottbarnes, Would it be useful to also track when they have bad data? This would seem imho to be as often as not, though that is purely subjective.

Jul 18 '24 20:07 LeadSongDog

@LeadSongDog, yes, that makes sense to me, though I confess off the top of my head I am not sure of great ways to do that, largely because I am unsure of how to mechanically determine a particular metadata field has 'bad' data, unless you're thinking, for example, when the publish_date is 'obviously' wrong (January 1, 1900, etc.).

Can you expand a bit more on what what metrics one might use?

Jul 18 '24 20:07 scottbarnes

Title conjoined with author and or date is a common case. Conjoined multiple authors is another (often with only surnames).

I would be reluctant to see any author record created solely on the basis of a transient AMZ/BWB record.

Pubdate of Jan 1 of any year is almost always an artifact (what publisher works New Year’s Day?)

Co-authors must be alive at the same time: if not, the earlier should be shown as the author and the later a contributor of some kind.

When an author is named but not readily différentiable from others of the same name, that should be reflected, as for “John Smith undifferentiated”, to avoid automatic attribution to the wrong author.

Jul 18 '24 21:07 LeadSongDog

@mekarpeles having clear test cases to demonstrate that this does the correct thing in all cases would have been very helpful. Can you please add them @scottbarnes? It looks like I got confused about adding source_record as a required field, but that's all the more reason to have clear test cases for each of the relevant cases we are trying to catch.

It looks like we will have to get 'sane' data examples of dates and authors defined on another PR. I think dates are handled, but again demonstrating and documenting the cases in tests will make it a lit clearer.

Aug 14 '24 20:08 hornc

@hornc, yes, I will add the tests. I agree it would help to have them more clear.

Aug 14 '24 20:08 scottbarnes

Additional considerations: In cases where name-only authors (sans disambiguation) are attributed in the source record it would be helpful to check for title, subject, or publisher matches among books by any of the possible synonymous authors. The default behaviour of just creating a new author record just creates more merging work downstream.

An obvious sanity check would ensure that the work is created during the author’s lifespan, but I frequently see exceptions that need to be corrected, so I infer it is not being checked.

Aug 14 '24 21:08 LeadSongDog