pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Space between words ignored in particular PDF file

Open joedrew opened this issue 11 years ago • 4 comments

We're using pdf-reader to extract text from PDFs, but it seems that in the case of this PDF, the space between the words "Yes" and "management" is ignored, meaning the text for this PDF comes out as "Yesmanagement".

I tried using @tdenovan fork, which is supposed to help with spaces, but that didn't fix this particular problem.

https://dl.dropboxusercontent.com/u/94239898/Resume-gretet-fdfsdf.pdf

joedrew avatar Dec 04 '14 18:12 joedrew

@joedrew Did you ever find a solution to this problem? I'm running into the same thing now.

JackWCollins avatar Nov 13 '15 16:11 JackWCollins

@JackWCollins Not from what I recall, no.

joedrew avatar Nov 13 '15 16:11 joedrew

Thanks for reporting this issue, and apologies for taking so long to review it. I've checked the sample file against current master, and confirmed the spacing issue still exists.

There's definitely some weaknesses in the layout algorithm in the PageLayout class, and it's interesting to see the tweaks made by @tdenovan to improve it. Unfortunately I'm short on time at the moment, but I'll happily review PRs if anyone wants to take a stab at improving that class.

yob avatar Feb 14 '17 14:02 yob

I think the issue may perhaps be caused by the following code: https://github.com/yob/pdf-reader/blob/6cc5c6663eb83c20ce5c43c06b3db716829ba969/lib/pdf/reader/page_text_receiver.rb#L117

Removing the SPACE check fixes the missing spaces on the second line of the 1st para for the following sample PDF file: https://royalegroupnyc.com/wp-content/uploads/seating_areas/sample_pdf.pdf

However, the PDF uses text strings terminated by spaces for both lines of the para, so there is clearly something else at play here. Note that the first line of the para has larger spaces between words; this may be relevant.

sebbASF avatar Oct 17 '21 21:10 sebbASF