pdfparser
pdfparser copied to clipboard
completely different output for table data (2.7.0 vs 2.8.0)
- PHP Version: 7.4 / 8.2 (same output)
- PDFParser Version: 2.7.0 vs 2.8.0
Description:
I'm using this library for more than a year now and until version 2.8.0 i didn't have a single issue with it. after updating to 2.8.0 i'm getting a completely different output for the same pdf file. unfortunately, this output can't be parsed in order to extract the data i'm interested in.
PDF input
Expected output & actual output
2.7.0 (this is ok and can easily be parsed)
2.8.0 (this can't be parsed)
Code
file_put_contents('./test2.dat', (new \Smalot\PdfParser\Parser())->parseFile('./invoice.pdf')->getText());
nevermind, i found a logic to make it work with 2.8.0...but this should still be investigated.
CC @GreyWyvern you may be interested in this.
Unfortunately this results from the new algorithm in 2.8.0 being more exact about spacing and line-feeds. It helps make normally extracted text from paragraphs better, but the "logic" of text in tables suffers. :|
You can see that 2.8.0 is putting newlines in the output exactly where it sees them, and when the document moves the cursor back up to the line above, but the next cell over, it also interprets this to be where a newline should be added.
It's only because 2.7.0 was very lenient with spacing (the newlines in the cells are not enough to trigger a newline in the output) that the resulting text appears more "logical". I'm not sure how this would be fixed definitively, but we could:
- Offer a user setting that makes detection of newlines more like 2.7.0, however this would affect text outside of tables as well.
- Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.
Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.
this sounds pretty good, but is it even possible with pdfs?
also, i'd have a suggestion (a little bit off-topic): it'd be great if you could implement somekind of line-by-line stream (like a generator) for getText
to not load all that stuff into memory at once.