bookreader icon indicating copy to clipboard operation
bookreader copied to clipboard

Text layer renders oddly for this book

Open cdrini opened this issue 3 years ago • 0 comments

Some of this is just bad OCR, but we could do a better job handling things perhaps?

Evidence / Screenshot (if possible)

image

https://archive.org/details/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/page/n5/mode/2up

Note:

  • The big blue box in the lower left corner. That is an auto-generated whitespace between two words, that are bad OCR from the decorative frame.
  • Note how some lines have very thin or off-spaced text layer highlights. This is likely due to text baseline being unspecified in the djvu xml. But specified in the hocr.

Context

  • PDF text layer rendering code that IA uses: https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfrenderer.py#L34
  • bookreader text layer rendering code: https://github.com/internetarchive/bookreader/blob/21c1606f871eb5a82aacd40e704f69b8d48f482f/src/plugins/plugin.text_selection.js#L186
  • The djvu XML for the book above: https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_djvu.xml
  • hocr for the book above: view-source:https://ia801507.us.archive.org/34/items/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750/bim_eighteenth-century_the-dutch-fortune-teller_booker-john-student-in_1750_hocr.html

Proposal & Constraints

  • Add a confidence threshold? Maybe >10?
  • Switch to reading the hocr html file and use the baseline?
  • Tweak the whitespace generating code to avoid big rects somehow?

Stakeholders @MerlijnWajer


  • See also JIRA: https://webarchive.jira.com/browse/WEBDEV-5395

cdrini avatar Aug 04 '22 17:08 cdrini