Have a list of work titles that will avoid conflation on edition title mismatch
Proposal
There are certain cases of works which we want to exclude imported editions/works from joining. For example:
- Generic titles like "Bill" / "Law" ( although I'm not sure we import these anymore )
- Certain series names, like "The Diary of a Wimpy Kid"
- Certain biographical books like "Picasso"
(links welcome please, was having trouble finding the exact examples I've seen in the past)
There aren't a ton of works in this category ; and although having a more systemic solution to the problem would be ideal, perhaps having a temporary solution that's just a list that makes the resolution process for these works more strict will be a boon.
We noticed while investigating some old code in https://github.com/internetarchive/openlibrary/pull/10336#discussion_r1916885894 that there was a code path there that tried to do something like this, but which due to a logic bug never actually did anything at all. The idea behind it is sound/useful though.
Justification
Breakdown
Requirements Checklist
- [ ]
Related files
Stakeholders
Instructions for Contributors
Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
Generally there are problems with titles containing the words "works", "novels", "stories", "plays", "poems", "selected", and "selections". These are frequently collected works where the collections vary. Editor names are important in these cases.
Hello, I'm Shwetas, I am a final-year BTech-CSE student. I have strong experience in Java, JavaScript, and the Full-Stack. I’ve worked on open-source projects and built SaaS platforms, and I believe I can contribute effectively to this project. I would be grateful for the opportunity to be assigned to this issue. Thanks!!
Here are some example titles:
- "Careers in Focus" https://openlibrary.org/works/OL31756013W/Careers_in_Focus
- "Eyewitness" https://openlibrary.org/works/OL7960560W/Eyewitness
- "Picasso" https://openlibrary.org/works/OL145191W/Picasso
- "Leonardo" https://openlibrary.org/works/OL695408W/Leonardo
- "Study Guide" https://openlibrary.org/works/OL21868175W/Study_Guide
- "Peppa Pig" https://openlibrary.org/search?q=%22peppa+pig%22&mode=everything
@scottbarnes could you please assign me that issue?
@shwetd19, I apologize for my tardy response. If you're still interested, can you help us with breakdown by:
- Clarifying what you believe the task to be.
- Identifying which files are related to the issue.
- Asking any questions you may have about the goal or requirements of this issue.
- Proposing a solution or approach (based on the suggestions Drini gave).
Thank you @scottbarnes for following up! Yes, I'm still interested in contributing to this issue. Let me break down my understanding:
Task Understanding:
- The goal is to implement a system to prevent incorrect work conflation for specific titles that are prone to mismatches
- This involves creating a list/mechanism to handle cases like generic titles (e.g., "Careers in Focus"), series names, and biographical books (e.g., "Picasso", "Leonardo") where stricter matching criteria should be applied
- As noted by @seabelis, this would also need to handle titles containing words like "works", "novels", "stories", "plays", "poems", "selected", and "selections"
Related Files: Before proposing specific files, I have a few questions:
- Where is the current work conflation logic implemented?
- Is there existing code from #10336 that attempted something similar that I should review?
- Should this be implemented as a configuration file (like a blocklist) or directly in the matching logic?
Questions:
- Should we consider editor names as a required matching criterion for these special cases?
- What would be the preferred format for storing this list of titles/patterns?
- Is there a specific threshold or criteria for adding titles to this list?
Proposed Approach:
- Create a configuration file to store the list of titles/patterns that require strict matching
- Modify the work conflation logic to check against this list
- For matched titles, implement stricter comparison rules including:
- Exact title matching instead of fuzzy matching
- Consider additional metadata like editor names
- Handle series names separately
Well, I have no excuse, @shwetd19. In any event, putting my failures aside:
- To the extent there is conflation logic, https://github.com/internetarchive/openlibrary/blob/master/openlibrary/catalog/add_book/init.py is probably the best place to start. Specifically
load()is where the different functions are called to try to determine if we should be making a new edition: https://github.com/internetarchive/openlibrary/blob/8ee01197ccb6c28c6f466e267a5cdf4269eaeff1/openlibrary/catalog/add_book/init.py#L1014-L1023 - I think you're right that the deleted code in #10336 could serve as an inspiration: https://github.com/internetarchive/openlibrary/pull/10336/files. Likely there can just be a list of these titles and some sort of case insensitive match, and then when there is a match, the import logic shouldn't match the title and work. @cdrini enumerated some examples here, which could be used to create unit tests to ensure the code doesn't allow this to happen, where all these editions are part of the same work, when they should be distinct works with distinct editions attached to those works: https://github.com/internetarchive/openlibrary/issues/10342#issuecomment-2622453443
To respond more directly to your questions, although this will be far from perfect and for the moment only work in the languages where we add terms in that language, hopefully we can just get by with a list and some matching. As for thresholds and requirements, I think this is something that may take some experimentation. Perhaps the best approach is to create the tests in advance using the examples from @cdrini's comment (mentioned above) and ensuring the tests only pass if the relevant editions and works are created or matched. Then the question can be what knobs and levers to manipulate to make that happen.
I will assign this to you, @shwetd19, though almost certainly I have waited too long. But if you're still interested, please say something all the same, and if not, please also let me know how I have failed you when you had the time work on this. :)
For anyone rearing to go, this issue is once again open.