pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

completely different output for table data (2.7.0 vs 2.8.0)

Open andus4n opened this issue 6 months ago • 4 comments

  • PHP Version: 7.4 / 8.2 (same output)
  • PDFParser Version: 2.7.0 vs 2.8.0

Description:

I'm using this library for more than a year now and until version 2.8.0 i didn't have a single issue with it. after updating to 2.8.0 i'm getting a completely different output for the same pdf file. unfortunately, this output can't be parsed in order to extract the data i'm interested in.

PDF input

c0

Expected output & actual output

2.7.0 (this is ok and can easily be parsed)

c1

2.8.0 (this can't be parsed)

c2

Code

file_put_contents('./test2.dat', (new \Smalot\PdfParser\Parser())->parseFile('./invoice.pdf')->getText());

andus4n avatar Feb 09 '24 18:02 andus4n

nevermind, i found a logic to make it work with 2.8.0...but this should still be investigated.

andus4n avatar Feb 10 '24 08:02 andus4n

CC @GreyWyvern you may be interested in this.

k00ni avatar Feb 12 '24 07:02 k00ni

Unfortunately this results from the new algorithm in 2.8.0 being more exact about spacing and line-feeds. It helps make normally extracted text from paragraphs better, but the "logic" of text in tables suffers. :|

You can see that 2.8.0 is putting newlines in the output exactly where it sees them, and when the document moves the cursor back up to the line above, but the next cell over, it also interprets this to be where a newline should be added.

It's only because 2.7.0 was very lenient with spacing (the newlines in the cells are not enough to trigger a newline in the output) that the resulting text appears more "logical". I'm not sure how this would be fixed definitively, but we could:

  • Offer a user setting that makes detection of newlines more like 2.7.0, however this would affect text outside of tables as well.
  • Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

GreyWyvern avatar Feb 13 '24 16:02 GreyWyvern

Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

this sounds pretty good, but is it even possible with pdfs? also, i'd have a suggestion (a little bit off-topic): it'd be great if you could implement somekind of line-by-line stream (like a generator) for getText to not load all that stuff into memory at once.

andus4n avatar Feb 13 '24 19:02 andus4n