bookwyrm icon indicating copy to clipboard operation
bookwyrm copied to clipboard

Dealing with Duplicate Authors

Open TomatDividedBy0 opened this issue 3 years ago • 18 comments

Currently an author can be registered separately as Soren Kierkegaard, Søren Kierkegaard, and Sören Kierkegaard, all of which are treated separately and will have books split among them.

Giving users the ability to mark Soren Kierkegaard and Sören Kierkegaard as "aliases" of Søren Kierkegaard would make the cataloguing a lot cleaner.

TomatDividedBy0 avatar May 21 '21 15:05 TomatDividedBy0

With apologies for chiming in on this issue unprompted, I’d like to make the observation that BookWyrm is already halfway there: it is possible to add aliases to an author – they just aren’t hooked up to anything (example – there are author entries for the aliases on that page, but no way to tell).

What is missing is a canonical author representation that integrates all aliases, so that any version of the name points to the same author page. I have no idea by how much that would complicate the model, so I will refrain from judging on how hard this would be to implement, but I will add that there also already is a UX workflow for confirming this kind of linking (the “is this a known author” dialog BookWyrm presents when editing a book’s author), so the change in the model is essentially all that is needed AFAICS.

kopischke avatar Oct 01 '21 16:10 kopischke

Ideally we'd hook into the ISNI database to dedupe authors, but this is one of the hard problems of bibliographic data science so I'm not sure how feasible that is for Bookwyrm's workflow.

hughrun avatar Oct 28 '21 01:10 hughrun

If there's a viable way to legitimately access the ISNI data, I think it could definitely be incorporated (and super valuable). I couldn't find any information on whether the sanction programmatic access or offer API keys, though, and my experience with things in the general OCLC sphere is that they aren't very accessible to people outside of institutions

mouse-reeve avatar Oct 28 '21 02:10 mouse-reeve

I'm happy to play around with this and report back!

The docs are pretty obscure but as best I can tell we just use an SRU request to a completely open API. Unfortunately it returns XML with an XSLT stylesheet but, well, that's OCLC and the library metadata world for you. For example:

http://isni.oclc.org/sru/?query=pica.nw+%3D+%22Soren+Kierkegaard%22&operation=searchRetrieve&recordSchema=isni-b

There also appears to be a mirror at isni.oclc.nl

hughrun avatar Oct 28 '21 04:10 hughrun

Ok as you can see from the draft PR, I have a proof of concept for this, though at present it doesn't do anything particularly useful other than display data in the book editing UI:

søren

We can search for authors in the ISNI database with a free GET request, no API keys required. Then we can display their brief description/bio with a link to their ISNI page. If readers can select the correct author from this list, that will go a long way to reducing duplications, and as @kopischke noted we're already most of the way there: author records already have an isni field and an aliases field, and we can fill or enrich those values from ISNI.

hughrun avatar Oct 29 '21 05:10 hughrun

Don't worry about the weird encoding, I worked it out: utf8

hughrun avatar Oct 29 '21 09:10 hughrun

#1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records. I'll have more of a think about that: it's probably something we'd want admins to do rather than just anyone, but it needs to be easy (like a checkbox yes/no), and probably should use some kind of scheduled or run-on-demand background task.

hughrun avatar Nov 01 '21 10:11 hughrun

#1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records.

Just spitballing here, but one thing we might be able to leverage is the existing alias system, i.e. wherever aliases overlap author names, queue the entries for possible merging.

it's probably something we'd want admins to do rather than just anyone

The question about who should be able to do this is an interesting one I feel might go beyond this specific issue. As of now, due to its small size and devoted community, spam, vandalism and ill intentioned manipulation of data are not an issue on BookWyrm, but if Mastodon (or Goodreads, for that matter) is any reference, that will change once it gains more traction.

kopischke avatar Nov 01 '21 10:11 kopischke

@kopischke yep, the existing name-or-alias query is pretty good. The problem is that names are not unique! So without a truly unique identifier you really need a human to eyeball each potential match at some point. My comment you quote was probably a bit unclear: what I mean is that there is no way to automatically in a guaranteed-to-be-correct way to merge records.

We definitely need another piece of functionality to manually merge them on the basis of some auto-generated helpful hints regarding potential matches - but I don't think the place for that is where I was working this time (the "edit book" workflow).

I realise this may look like a backwards way to come at the problem but I figured it's easier to clean up the old mess if you've at least partially stemmed the flow of new mess coming in.

hughrun avatar Nov 01 '21 10:11 hughrun

The problem is that names are not unique! So without a truly unique identifier you really need a human to eyeball each potential match at some point.

@hughrun totally agree, we can’t and shouldn’t do that automatically. My point was that once we have a system in place to attach an ISNI ID to a BookWyrm Author entity, which this PR provides, we’ll have a starting point for manual merging. By identifying Authors not having an ISNI ID attached, but having an alias matching one or more of those who do, we should get a pretty good base for identifying merge candidates. And yes, that queue should absolutely be reviewed manually – there’s simply too much context to mind (and possibly research for context to do) for anything else.

I realise this may look like a backwards way to come at the problem but I figured it's easier to clean up the old mess if you've at least partially stemmed the flow of new mess coming in.

Again, couldn’t agree more, and I think it actually is the right way around. I was just trying to get the ball rolling for the step after that, which admittedly might be a bit premature.

kopischke avatar Nov 01 '21 10:11 kopischke

#1581 should reduce the frequency of this problem but doesn't actually provide a way to merge already-existing records.

Just spitballing here, but one thing we might be able to leverage is the existing alias system, i.e. wherever aliases overlap author names, queue the entries for possible merging.

it's probably something we'd want admins to do rather than just anyone

The question about who should be able to do this is an interesting one I feel might go beyond this specific issue. As of now, due to its small size and devoted community, spam, vandalism and ill intentioned manipulation of data are not an issue on BookWyrm, but if Mastodon (or Goodreads, for that matter) is any reference, that will change once it gains more traction.

If trust is an issue, there's ways to manage that while still getting the benefits of crowdsourcing the archival, some long-term ideas I can spitball:

  • A publicly viewable audit log with a list of changes made and a function to report/undo changes
  • The ability to propose an author-merge which can then be forwarded to a moderator for further consideration.

Although, this does make me wonder? How does the federated model currently handle authors/books, given that you have a common pool of data across multiple instances?

TomatDividedBy0 avatar Nov 02 '21 18:11 TomatDividedBy0

I'm not sure if this is considered, but I also see duplicate authors which, based on the works associated with them, are obviously meant to be the same, and with the exact same spelling. It'd be nice to be able to merge them.

But let's not do this automatically! There are at least two distinct authors called "Rick Wayne", for example! These need to be kept separate. It'd be similarly good to have an option to, essentially, say "this book is actually by a different author of the same name" - that's possible by editing the book, removing the author, re-adding the author, and then choosing "this is a new author". A bit cumbersome, but at least it already works.

jfinkhaeuser avatar Aug 11 '22 07:08 jfinkhaeuser

I do plan to add some automatic merging, but it will never happen based on author name -- entries would only be merged if they shared a much more reliable unique identifier like the same ISNI or wikipedia entry. And I agree that automatic merging doesn't fully address the problem that manual merging would.

mouse-reeve avatar Aug 11 '22 13:08 mouse-reeve

Another related suggestion: there is, for example, Andre Norton for whom a duplicate entry exists on OpenLibrary. This was imported into bookwyrm, so that some works are attributed to "Andre Norton (duplicate)" (I edited those, there weren't too many).

In addition to the aforementioned aliases, and in light of such issues, it might be good to treat authors a bit like the work/edition split. That is, have a canonical author profile to which other profiles can be linked as duplicates.

When editing an author profile, I imagine there are two options, both of which have a use:

  • Add duplicate profiles (UI similar to how authors are added to editions); adding even a single duplicate makes the profile a canonical profile.
  • Display similar named profiles; selecting one makes the current profile a duplicate of the selected.

Furthermore:

  • In the above as well as when editing or adding editions, prefer canonical profiles. That should help de-duplicate profiles.
  • When visiting any of the profiles, list the profiles from related (canonical and their duplicates) authors. It may be worth putting these into different sections, e.g. locally added first, then canonical, then duplicates or some such.
  • Still add the ability to manually merge profiles. This helps fix mistakes

The rationale here is that it'd be perfectly possible to keep distinct spellings as distinct profiles, if it's a language thing that keeps producing these things - but if it's just users being inattentive or import sources having issues, you can converge on a saner data set over time.

jfinkhaeuser avatar Sep 27 '22 08:09 jfinkhaeuser