openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Patch & remediate book titles with incorrect html encoding (e.g. Ukrainian)

Open bicolino34 opened this issue 7 months ago • 3 comments

Problem

Image

Reproducing the bug

  1. Go to https://openlibrary.org/search?q=language%3Aukr+-edition.annas_archive%3A*&mode=everything&sort=new&page=40
  • Expected behavior:
  • Actual behavior:

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

bicolino34 avatar May 28 '25 14:05 bicolino34

It looks like one of the sources may be BetterWorldBooks encoding

https://openlibrary.org/books/OL36372351M/1055_1110_1089_1085_1103_1093_1074_1072_1083_1080?m=history

2 Next Steps: In any case, we should:

  1. patch our importer at a source of truth moving forward (that should be a new issue) and also
  2. remediate problematic titles as a separate effort with a 1-time pass of the data dump.

mekarpeles avatar Jun 02 '25 19:06 mekarpeles

@bicolino34 can you possibly create issues for these two tasks? If you have the time, that would be a big help.

mekarpeles avatar Jun 02 '25 19:06 mekarpeles

This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything

cdrini avatar Jun 02 '25 22:06 cdrini

This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything

I'm not sure why this is cast as a Ukrainian problem. That search currently returns 258 results, while removing the language filter produces over 200,000 hits and Asian titles demonstrate this problem frequently

This 1-liner

gzcat ol_dump_editions_2025-11-06.txt.gz | cut -f 5 | jq -r '[.key,.title,.subtitle?] | @tsv'  | grep -E '&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});' | gzip > edition-html-entities.tsv.gz

will produce a list 148.5K edition titles which need fixing (and there are another 72K work titles & 15K author names which need fixing too).

The theory that BWB is the source of this trashy metadata is a good one. Only 480 of the corrupted edition records don't include a BWB source record, so this one bookseller is the source for virtually ALL of this invalid low quality metadata.

Note that because these errors will have defeated the matching algorithms, many of these records will be duplicates too.

tfmorris avatar Nov 14 '25 20:11 tfmorris