Text layer renders oddly for this book

Open cdrini opened this issue 3 years ago • 0 comments

Some of this is just bad OCR, but we could do a better job handling things perhaps?

Evidence / Screenshot (if possible)

https://archive.org/details/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/page/n5/mode/2up

Note:

The big blue box in the lower left corner. That is an auto-generated whitespace between two words, that are bad OCR from the decorative frame.
Note how some lines have very thin or off-spaced text layer highlights. This is likely due to text baseline being unspecified in the djvu xml. But specified in the hocr.

Context

PDF text layer rendering code that IA uses: https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py#L34
bookreader text layer rendering code: https://github.com/internetarchive/bookreader/blob/21c1606f871eb5a82aacd40e704f69b8d48f482f/src/plugins/plugin.text_selection.js#L186
The djvu XML for the book above: https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_djvu.xml
hocr for the book above: view-source:https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_hocr.html

Proposal & Constraints

Stakeholders @MerlijnWajer

Aug 04 '22 17:08 cdrini