This gem is not able to extract the line near pdf page break

Open shibli786 opened this issue 8 years ago • 1 comments

Issue 1 This gem is not able to extract the line near pdf page break some time I have attached the PDF file Extract the text and check ON PAGE 2 last line (just before the page break) is not getting extracted ON the page 2 of attached PDF

  "**TA PIMCO Total Return- Service Class 3 5/1/2002 -0.36 3.72 0.92 1.68 0.62 3.13 3.22"**

is not getting extracted

Issue 2 If text has some subscript then it got appended to the word and some time the subscript is appended in new line \n please extract the content and check the textfile abc.pdf

Nov 11 '17 19:11 shibli786

Issue one seems to have been resolved - I can't reproduce it on the latest release (v2.2.1).

Issue two will be harder to address in a consistent way.

In this particular PDF, the superscript numbers are regular numbers printed in a smaller font (not unicode superscripts codepoints). That makes it hard to reliably identify them as superscript.

With a bit of tweaking to the page layout algorithm it'd probably be possible to have them rendered t the same line as the text they're associated with, but they'd appear as full height normal numbers.

Oct 26 '19 12:10 yob