Non-book retail display items are being imported
Problem
Sometimes records with the requisite fields (title, authors, publishers, publish_date, and source_records) are nonetheless not books, the title will contain strong evidence of this. Consider the following recent imports:
- https://openlibrary.org/works/OL39115273W/Rose_Madder_24cc_Dumpbin
- https://openlibrary.org/works/OL39112422W/Bag_of_Bones_18_Copy_Bin
- https://openlibrary.org/works/OL39113050W/Wizard_and_Glass_Poster
- https://openlibrary.org/works/OL39112851W/Hammicks_2_for_%C3%9A10_48cc_King_Bin
- https://openlibrary.org/works/OL39112866W/Song_of_Susannah_18_Copy_Hardback_Bin
All meet the minimum criteria for import, and all are not books. "Bin", "dumpbin", "x copy", and "poster" are all terms for non-book display items.
Reproducing the bug
No response
Context
No response
Notes from this Issue's Lead
Proposal & constraints
It might be possible to use a regex or otherwise match the end of the title field to see if it ends in dumpbin, bin, or poster (note the leading space), but we'd want to ensure we don't block false positives.
Perhaps as a test it would be possible to parse the Works dump, available at https://openlibrary.org/developers/dumps, do a basic analysis to see what would be block from import this basic title search.
Directions for importing books locally can be found at https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing, but as a threshold matter it likely makes sense to test out the proposed solution using the data dump before implementing any solution in the Open Library code base.
Related files
Stakeholders
@seabelis
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
"prepack" is another term to potentially block.
In the source data which is being imported there is some code to check the format:
https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L112-L116
and
https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L162-L164
Is there any way these records can be traced back to the source data and find the primary_format used? There maybe some other codes that should be added to that non-book list.
I think the solution may be extending the https://github.com/internetarchive/openlibrary/blob/master/scripts/partner_batch_imports.py#L237 quality checks for the partner imports that run monthly the ~15th.
You should be able to find the data via archive.org items referenced in olsystem etl and ol-home0:/1/
from looking at one data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.
The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.