openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

malformed genre entries (example for scifi), break search cta from homepage

Open jzacsh opened this issue 4 years ago • 1 comments

470~ works incorrectly listed in a nonsense genre Fiction, science fiction, space opera (those are three different genres) - see here: https://openlibrary.org/subjects/fiction_science_fiction_space_opera

Evidence / Screenshot (if possible)

Screenshot 2021-01-24 at 13 57 38

Relevant url?

https://openlibrary.org/subjects/fiction_science_fiction_space_opera

Steps to Reproduce

  1. visit home page
  2. click science fiction [14,712 books] icon in the big "browse by subject" banner
    • you'll land here: https://openlibrary.org/subjects/science_fiction#sort=date_published&ebooks=true
  3. search for an author you know writes scifi, eg "Nnedi Okorafor"
    • you'll land here: https://openlibrary.org/search?subject_facet=Science+fiction&q=nnedi+okorafor
  4. bug: no results - start debugging, since you're surprised:
    • a. transform to general search (without subject_facet in GET params)
    • b. expected: you found Nnedi's works!
    • c. click a work, and click it's "Edit" button (eg: the "Binti" work)
      • you'll land here: https://openlibrary.org/books/OL28911537M/Binti/edit
    • d. click "work details" tab
    • e. inspect Subject keywords? form-field (instruction says Please separate with commas. For example: cheese, Roman Empire, psychology)
    • f. root cause: value of form-entry is "Fiction, science fiction, general"
    • g. see if other works have this problem, or just this work had a bad entry?
      • answer: large problem - some 400 works have this bug: https://openlibrary.org/subjects/fiction_science_fiction_space_opera

note: this is the same for Consider Phlebas but as you can see it's overcome by a correct genre being added in beside the nonsense genre (so the book shows up facet searches as expected from the home page).

Expected/Actual

root cause of the bug is step 4f being wrong (and this is true of many books; see 4g):

  • Actual: something created a subject "Fiction, science fiction, space opera"
  • Expected: this was a data-entry mistake and should have been Fiction, science fiction, space opera (or maybe "Fiction", "science fiction", "space opera" - the point here is not to quote the entire string)

Details

  • Environment (prod/dev/local)? prod

Proposal & Constraints

  1. instance fix: mass-fix of all books in this category (update them all to delete this category and insert 3 new categories of Fiction, science fiction, space opera)
    • 1a) look for other instances (eg: maybe do a database search for all works with a subject field containing quotes?) and fix those just the same.
  2. systemic fix: I'd guess it's unlikely all 400 books got this bug from a single user's bad import csv - my guess is there's something broader perpetuating the bug (eg: an autocomplete that other users click in a dropdown, not knowing the selection is malformed data).
    • the systemic fix would be to make sure both import-logic as well as single-entry form-validation warn and try to help mitigate these kinds of errors (commas nested inside quotes) and either automatically strip the quotes or help users determine if an internal comma is really desired.

jzacsh avatar Jan 24 '21 20:01 jzacsh

Many subject tags are imported with the record and not in our direct control. There are over 2,000 items with this tag alone. This is too many to fix manually. Maybe someone would like to try to remove this programmatically?

seabelis avatar Jul 01 '24 13:07 seabelis

@seabelis how about proposal 1a, or 2: do those sound possible to you?

While I'm not available to work on this any time soon, I'd guess the fastest help to that next contributor: pointers to where solutions 1,1a, 2 might plausibly start off in this codebase, or which proposals the team prefers/dislikes.


Maybe someone would like to try to remove this programmatically?

Edit: Also I should point out that proposal 1 doesn't have to be a mass removal (in fact that might leave the buggy search experience still intact for many books), but could be a re-insertion of the intended/fixed values.

jzacsh avatar Jul 01 '24 14:07 jzacsh

Oh interestingly: #7904 seems to be a newer (2years later) rethink of the data structure involved here. I'd guess it's important that whatever is proposed here should be coordinated closely with those folks.

jzacsh avatar Jul 01 '24 15:07 jzacsh

This subject was imported from Better World Books which is infamous for providing garbage metadata, but we've been unable to convince the powers that be to stop importing from it. ~Obviously having subjects with embedded commas is incompatible with using commas as the delimiter in the subject data entry field, so they would need to be escaped in some way, but~ (EDIT: subjects containing commas are quoted in the editing field) It is likely that it was originally intended to be the hierarchical genre "Fiction / Science Fiction / Space Opera" as you can see from the Library of Congress hierarchy here: https://id.loc.gov/authorities/genreForms/gf2014026551.html You can also see it in textual form rather than "broader" links at the bottom of this MARC record: https://openlibrary.org/show-records/marc_loc_2016/BooksAll.2016.part41.utf8:166410102:1707

You can see all the different ways that "space opera" is spelled on OpenLibrary with different hierarchy delimiters here: https://openlibrary.org/search/subjects?q=space+opera

My feature request (#2819) to make subjects first class objects instead of strings was an attempt to bring some order to this as well as allow links to things like LCSH, FAST, and Wikidata. It would also support internationalization for things like Novelas del espacio

The best fix would be to stop importing from BWB, but failing that all the bad metadata should be filtered out (which is probably effectively the same thing).

tfmorris avatar Jul 15 '24 15:07 tfmorris

Closing as a duplicate of #7904.

seabelis avatar Jan 13 '25 10:01 seabelis