Space between words ignored in particular PDF file
We're using pdf-reader to extract text from PDFs, but it seems that in the case of this PDF, the space between the words "Yes" and "management" is ignored, meaning the text for this PDF comes out as "Yesmanagement".
I tried using @tdenovan fork, which is supposed to help with spaces, but that didn't fix this particular problem.
https://dl.dropboxusercontent.com/u/94239898/Resume-gretet-fdfsdf.pdf
@joedrew Did you ever find a solution to this problem? I'm running into the same thing now.
@JackWCollins Not from what I recall, no.
Thanks for reporting this issue, and apologies for taking so long to review it. I've checked the sample file against current master, and confirmed the spacing issue still exists.
There's definitely some weaknesses in the layout algorithm in the PageLayout class, and it's interesting to see the tweaks made by @tdenovan to improve it. Unfortunately I'm short on time at the moment, but I'll happily review PRs if anyone wants to take a stab at improving that class.
I think the issue may perhaps be caused by the following code: https://github.com/yob/pdf-reader/blob/6cc5c6663eb83c20ce5c43c06b3db716829ba969/lib/pdf/reader/page_text_receiver.rb#L117
Removing the SPACE check fixes the missing spaces on the second line of the 1st para for the following sample PDF file: https://royalegroupnyc.com/wp-content/uploads/seating_areas/sample_pdf.pdf
However, the PDF uses text strings terminated by spaces for both lines of the para, so there is clearly something else at play here. Note that the first line of the para has larger spaces between words; this may be relevant.