openlibrary Patch & remediate book titles with incorrect html encoding (e.g. Ukrainian)

Problem

Reproducing the bug

Go to https://openlibrary.org/search?q=language%3Aukr+-edition.annas_archive%3A*&mode=everything&sort=new&page=40

Expected behavior:
Actual behavior:

Context

Browser (Chrome, Safari, Firefox, etc):
OS (Windows, Mac, etc):
Logged in (Y/N):
Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

[ ]

Related files

Stakeholders

Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

May 28 '25 14:05 bicolino34

It looks like one of the sources may be BetterWorldBooks encoding

https://openlibrary.org/books/OL36372351M/1055_1110_1089_1085_1103_1093_1074_1072_1083_1080?m=history

2 Next Steps: In any case, we should:

patch our importer at a source of truth moving forward (that should be a new issue) and also
remediate problematic titles as a separate effort with a 1-time pass of the data dump.

Jun 02 '25 19:06 mekarpeles

@bicolino34 can you possibly create issues for these two tasks? If you have the time, that would be a big help.

Jun 02 '25 19:06 mekarpeles

This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything

Jun 02 '25 22:06 cdrini

This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything

I'm not sure why this is cast as a Ukrainian problem. That search currently returns 258 results, while removing the language filter produces over 200,000 hits and Asian titles demonstrate this problem frequently

This 1-liner

gzcat ol_dump_editions_2025-11-06.txt.gz | cut -f 5 | jq -r '[.key,.title,.subtitle?] | @tsv'  | grep -E '&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});' | gzip > edition-html-entities.tsv.gz

will produce a list 148.5K edition titles which need fixing (and there are another 72K work titles & 15K author names which need fixing too).

The theory that BWB is the source of this trashy metadata is a good one. Only 480 of the corrupted edition records don't include a BWB source record, so this one bookseller is the source for virtually ALL of this invalid low quality metadata.

Note that because these errors will have defeated the matching algorithms, many of these records will be duplicates too.

Nov 14 '25 20:11 tfmorris