openlibrary Non-book retail display items are being imported

Problem

Sometimes records with the requisite fields (title, authors, publishers, publish_date, and source_records) are nonetheless not books, the title will contain strong evidence of this. Consider the following recent imports:

https://openlibrary.org/works/OL39115273W/Rose_Madder_24cc_Dumpbin
https://openlibrary.org/works/OL39112422W/Bag_of_Bones_18_Copy_Bin
https://openlibrary.org/works/OL39113050W/Wizard_and_Glass_Poster
https://openlibrary.org/works/OL39112851W/Hammicks_2_for_%C3%9A10_48cc_King_Bin
https://openlibrary.org/works/OL39112866W/Song_of_Susannah_18_Copy_Hardback_Bin

All meet the minimum criteria for import, and all are not books. "Bin", "dumpbin", "x copy", and "poster" are all terms for non-book display items.

Reproducing the bug

No response

Context

No response

Notes from this Issue's Lead

Proposal & constraints

It might be possible to use a regex or otherwise match the end of the title field to see if it ends in dumpbin, bin, or poster (note the leading space), but we'd want to ensure we don't block false positives.

Perhaps as a test it would be possible to parse the Works dump, available at https://openlibrary.org/developers/dumps, do a basic analysis to see what would be block from import this basic title search.

Directions for importing books locally can be found at https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing, but as a threshold matter it likely makes sense to test out the proposed solution using the data dump before implementing any solution in the Open Library code base.

Related files

Stakeholders

@seabelis

Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

Aug 17 '24 18:08 scottbarnes

"prepack" is another term to potentially block.

Aug 18 '24 07:08 seabelis

In the source data which is being imported there is some code to check the format:

https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L112-L116

and

https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L162-L164

Is there any way these records can be traced back to the source data and find the primary_format used? There maybe some other codes that should be added to that non-book list.

Aug 18 '24 21:08 hornc

I think the solution may be extending the https://github.com/internetarchive/openlibrary/blob/master/scripts/partner_batch_imports.py#L237 quality checks for the partner imports that run monthly the ~15th.

You should be able to find the data via archive.org items referenced in olsystem etl and ol-home0:/1/

Aug 19 '24 19:08 mekarpeles

from looking at one data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.

The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.

Sep 06 '24 01:09 hornc