openlibrary
openlibrary copied to clipboard
Imports via ISBN searches create duplicate works
When a user searches by ISBN and the edition is not already present in the catalog, the edition is imported. However, this frequently creates a new work when an appropriate work already exists.
Evidence / Screenshot (if possible)
Relevant url?
https://openlibrary.org/books/OL28122354M/The_Hound_of_the_Baskervilles (imported edition) https://openlibrary.org/works/OL20779126W/The_Hound_of_the_Baskervilles (created work) This import should have been associated with https://openlibrary.org/works/OL262454W/The_Hound_of_the_Baskervilles
Steps to Reproduce
- Search for an ISBN that does not exist in the Open Library catalog, but exists on amazon and has a corresponding work on Open Library.
- If imported, check to see if a new work (duplicate) was created or if the edition is associated with the existing work. (I've experienced both cases.)
- Actual: Sometimes the import creates a new work when an appropriate work already exists.
- Expected: If an appropriate work exists, the import should not create a new work.
Details
- Logged in (Y/N)?
- Browser type/version?
- Operating system?
- Environment (prod/dev/local)? prod
Proposal & Constraints
Related files
Stakeholders
@mekarpeles @cdrini
@seabelis I think we need an example that remains unmerged :)
I think this just happened to me, so I've purposely not merged them (hopefully they stay that way while you investigate):
- a) https://openlibrary.org/works/OL20791728W/Lily_and_the_octopus
- b) https://openlibrary.org/works/OL20040818W/Lily_and_the_octopus
some hopefully helpful details:
- these are both duplicates (i'm guessing they came into being via my goodreads csv import where i had an ISBN
1501126229
and9781501126222
present1) - they're both attached correctly attached to this author: https://openlibrary.org/authors/OL7611067A/Steven_Rowley
-
works/OL20791728W
is the one where my data shows up, so I'm guessing that's perhaps the duplicate generated by my CSV. - interestingly it's surprisingly difficult to spot/fix/understand duplicate works2
So questions this raised for me, but honestly these are all tangent to fixing the importer bug (so feel free to ask me to file separate meta-issue to discuss or something):
- 4a) is there a way to detect duplicates?
- 4b) when someone does find a dupe, how do you report/fix it? Currently the top search result i found was this FAQ https://openlibrary.org/help/faq?v=102#delete-book says to email [email protected] but i'm guessing that's not intended/scalable way to merge duplicates.
- 4c) is there an author-page bug here causing both works not to list?
Concretely: if you click the author's link from either work, you can't find your way to the other work from the author's page (despite point 2 above). Once you land on the author's page, only
works/OL20040818W
shows. - 4d) the work object is confusingly identifying itself as if it's a particular edition; steps to repro:
- i) land on an edition, like
books/OL27220852M
- ii) (good) note: the banner across the top: "An edition of Lily and the octopus (2016)"
- iii) (good) note: text i bolded above is a hyperlink to the work
works/OL20040818W
- iv) click that hyperlink to see the work itself
- v) (good) note: you're now on the work page (see URL
/work/...
) - vi) (bad/bug) the banner is still present, implying you're viewing an edition
- i) land on an edition, like
- 4e) the author seems to be ignored when import creates a new edition? even though my import row had defined the author, the author wasn't populated in the newly-created edition (and you can see update in my next comment below this might cause similar/orthogonal duplicate-author problems)
1: here's the relevant snippet of my CSV:
Book Id,Title,Author,Author l-f,Additional Authors,ISBN,ISBN13,My Rating,Average Rating,Publisher,Binding,Number of Pages,Year Published,Original Publication Year,Date Read,Date Added,Bookshelves,Bookshelves with positions,Exclusive Shelf,My Review,Spoiler,Private Notes,Read Count,Recommended For,Recommended By,Owned Copies,Original Purchase Date,Original Purchase Location,Condition,Condition Description,BCID
27276262,Lily and the Octopus,Steven Rowley,"Rowley, Steven",,"=""1501126229""","=""9781501126222""",5,3.71,Simon Schuster,Hardcover,307,2016,2016,2018/04/16,2018/02/25,own-hardcopy,own-hardcopy (#3),read,"I read over half the book in just a day, without realizing it. Beware: you will bawl.",,,1,,,1,,,unspecified,,
2: I was only able to do it because I knew I should have had "already read" data and a rating attached, but I wasn't seeing it - finding the other edition required my manually paging through my profile's data until i found a link to the book and indeed it was a different work ID)
i'm guessing search index updated and now there's actually four duplicates of the same work:
also in the sidebar you can see three duplicates of the same author:
screenshot of results for the query "Lily and the octopus" (explicitly overriding the "ebook" default mode since the point here is to understand OL's works data): https://openlibrary.org/search?q=Lily+and+the+octopus&mode=everything here:
okay last update, in case i'm just causing noise for veteran OL folks :)
i'm going to leave this for mek and others familiar with OL to sort out, but as a newbie to OL (ie: possibly very wrong) here's what i've understood so far:
problem: import creates new work for any unrecognized isbn (ie: edition)
I think the problem is that the import logic searches against an ISBN, and if that edition doesn't exist in OL already, the API returns a newly created edition and correspondingly a newly created work. it's the newly created work which is perhaps the only bug?
solution: make best-effort to match against works by non-ID means (eg: title,author, etc.)
if that^ understanding is right, then one fix could be: a bit of new UX and wiring into the (solr?) search; eg:
- update the import logic, when it encounters an unrecognized ISBN, to first do a general search against
(title,author)
(as happens under the hood by users, eg search above)). - If there's a match against said search, include that in the import UI but let users see that we're doing two things:
- a) we're proposing creating a whole new edition
- b) we're proposing that this is actually part of
/works/1234
by adding (2b) user can untick the row if it's wrong, and/or we can add a step (3) that let's the user correct it inline with some new UX.
alternative: maybe just don't call /isbn/1234.json
anymore in the UI and do an advanced/field-specific search like this instead (and if that returns no matches, query again without the ISBN)?