bookreader
bookreader copied to clipboard
Text layer renders oddly for this book
Some of this is just bad OCR, but we could do a better job handling things perhaps?
Evidence / Screenshot (if possible)

https://archive.org/details/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/page/n5/mode/2up
Note:
- The big blue box in the lower left corner. That is an auto-generated whitespace between two words, that are bad OCR from the decorative frame.
- Note how some lines have very thin or off-spaced text layer highlights. This is likely due to text baseline being unspecified in the djvu xml. But specified in the hocr.
Context
- PDF text layer rendering code that IA uses: https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py#L34
- bookreader text layer rendering code: https://github.com/internetarchive/bookreader/blob/21c1606f871eb5a82aacd40e704f69b8d48f482f/src/plugins/plugin.text_selection.js#L186
- The djvu XML for the book above: https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_djvu.xml
- hocr for the book above: view-source:https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_hocr.html
Proposal & Constraints
- Add a confidence threshold? Maybe >10?
- Switch to reading the hocr html file and use the baseline?
- Tweak the whitespace generating code to avoid big rects somehow?
Stakeholders @MerlijnWajer
- See also JIRA: https://webarchive.jira.com/browse/WEBDEV-5395