Patch & remediate book titles with incorrect html encoding (e.g. Ukrainian)
Problem
Reproducing the bug
- Go to https://openlibrary.org/search?q=language%3Aukr+-edition.annas_archive%3A*&mode=everything&sort=new&page=40
- Expected behavior:
- Actual behavior:
Context
- Browser (Chrome, Safari, Firefox, etc):
- OS (Windows, Mac, etc):
- Logged in (Y/N):
- Environment (prod, dev, local): prod
Breakdown
Requirements Checklist
- [ ]
Related files
Stakeholders
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
It looks like one of the sources may be BetterWorldBooks encoding
https://openlibrary.org/books/OL36372351M/1055_1110_1089_1085_1103_1093_1074_1072_1083_1080?m=history
2 Next Steps: In any case, we should:
- patch our importer at a source of truth moving forward (that should be a new issue) and also
- remediate problematic titles as a separate effort with a 1-time pass of the data dump.
@bicolino34 can you possibly create issues for these two tasks? If you have the time, that would be a big help.
This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything
This query should get us close to what we need: https://openlibrary.org/search?q=language%3Aukr+title%3A%22%23%22&mode=everything
I'm not sure why this is cast as a Ukrainian problem. That search currently returns 258 results, while removing the language filter produces over 200,000 hits and Asian titles demonstrate this problem frequently
This 1-liner
gzcat ol_dump_editions_2025-11-06.txt.gz | cut -f 5 | jq -r '[.key,.title,.subtitle?] | @tsv' | grep -E '&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});' | gzip > edition-html-entities.tsv.gz
will produce a list 148.5K edition titles which need fixing (and there are another 72K work titles & 15K author names which need fixing too).
The theory that BWB is the source of this trashy metadata is a good one. Only 480 of the corrupted edition records don't include a BWB source record, so this one bookseller is the source for virtually ALL of this invalid low quality metadata.
Note that because these errors will have defeated the matching algorithms, many of these records will be duplicates too.