pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Strange behaviour parsing PDF File

Open ondrejbartas opened this issue 8 years ago • 1 comments

Hi,

I have this wierd error: screen shot 2017-05-02 at 14 52 07

And I am getting this result by x =File.open('~/billapp.pdf', 'rb')

I am adding that PDF here billapp.pdf

With other PDFs it is working fine but with this one not :(

ondrejbartas avatar May 02 '17 12:05 ondrejbartas

Sorry I didn't get a around to looking into this in 2017 😞

I just had a proper look and confirmed this issue is still happening in v2.8.0, and that evince can extract the text correctly. It's surprising because the file metadata claims it was created by prawn, and usually pdf-reader can handle prawn generated files just fine.

The root issue appears to be this conditional: https://github.com/yob/pdf-reader/blob/951f9c2659ce3b25c7731d79d54a2ce4ae3bc8e4/lib/pdf/reader/font.rb#L54-L60

The fonts in this file have ToUnicode cmaps so we defer all unicode conversion to them. However, the CMaps only have a handful of mappings defined in them. I'm not sure if the CMaps should have some default mappings in them, or maybe we should be falling back to the encoding dict for glyphs not explicitly listed in the CMap 🤔

yob avatar Jan 13 '22 00:01 yob