pdfparser completely different output for table data (2.7.0 vs 2.8.0)

PHP Version: 7.4 / 8.2 (same output)
PDFParser Version: 2.7.0 vs 2.8.0

Description:

I'm using this library for more than a year now and until version 2.8.0 i didn't have a single issue with it. after updating to 2.8.0 i'm getting a completely different output for the same pdf file. unfortunately, this output can't be parsed in order to extract the data i'm interested in.

PDF input

Expected output & actual output

2.7.0 (this is ok and can easily be parsed)

2.8.0 (this can't be parsed)

Code

file_put_contents('./test2.dat', (new \Smalot\PdfParser\Parser())->parseFile('./invoice.pdf')->getText());

Feb 09 '24 18:02 andus4n

nevermind, i found a logic to make it work with 2.8.0...but this should still be investigated.

Feb 10 '24 08:02 andus4n

CC @GreyWyvern you may be interested in this.

Feb 12 '24 07:02 k00ni

Unfortunately this results from the new algorithm in 2.8.0 being more exact about spacing and line-feeds. It helps make normally extracted text from paragraphs better, but the "logic" of text in tables suffers. :|

You can see that 2.8.0 is putting newlines in the output exactly where it sees them, and when the document moves the cursor back up to the line above, but the next cell over, it also interprets this to be where a newline should be added.

It's only because 2.7.0 was very lenient with spacing (the newlines in the cells are not enough to trigger a newline in the output) that the resulting text appears more "logical". I'm not sure how this would be fixed definitively, but we could:

Offer a user setting that makes detection of newlines more like 2.7.0, however this would affect text outside of tables as well.
Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

Feb 13 '24 16:02 GreyWyvern

Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

this sounds pretty good, but is it even possible with pdfs? also, i'd have a suggestion (a little bit off-topic): it'd be great if you could implement somekind of line-by-line stream (like a generator) for getText to not load all that stuff into memory at once.

Feb 13 '24 19:02 andus4n

pdfparser pdfparser copied to clipboard

completely different output for table data (2.7.0 vs 2.8.0)

Description:

PDF input

Expected output & actual output

2.7.0 (this is ok and can easily be parsed)

2.8.0 (this can't be parsed)

Code

pdfparser
pdfparser copied to clipboard