pdf-reader
pdf-reader copied to clipboard
Whitespaces removed with certain fonts
Given the pdf file sample.pdf, that has a few lines of text using different fonts, when I try to extract the text on the page with
file = File.open('./tmp/sample.pdf')
reader = PDF::Reader.new(file)
puts reader.pages.first.text
I get
Spaces with font Courier bold
Spaces with font Courier normal
Spaces with font Times-Roman bold
Spaces with font Times-Roman normal
Spaces with font Helvetica bold
Spaces with font Helvetica normal
SpaceswithfontLatobold
SpaceswithfontLatonormal
Notice that for the text in Lato
font, whitespaces have been removed.
I was expceting whitespaces to be preserved.
Spaces with font Lato bold
Spaces with font Lato normal
Is this because Lato
's space glyph is not wide enough for the criteria in PDF::Reader#+
?
https://github.com/yob/pdf-reader/blob/d931456e372cc029209b4d8b321496396a7d35df/lib/pdf/reader/text_run.rb#L67