pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Text Layout are unpredictable

Open DaLiV opened this issue 6 months ago • 4 comments

Often you can have PDF that generated by some customer apps, with not obvious order of objects. if you need parse text im "page-position-order" you can become crap of data from tables.

try extract data from attached pdf. Pdf-Layout-Issue.pdf

DaLiV avatar Feb 01 '24 22:02 DaLiV

Please be more specific. What PHP version? What version of PDFParser? What do you mean with "crap"?

k00ni avatar Feb 02 '24 06:02 k00ni

I do get what @DaLiV is saying here. This is a not really a bug, but more of a consequence of the way PdfParser currently handles PDF document streams.

A PDF document stream can include positioning commands between commands that print text. 19 times out of 20 the software creating the document will order the stream so that the upper left-most text is first, followed line-by-line down and left to right, until the bottom of the document is reached. However, it's entirely possible for a document to print some text (Item 1) in the upper left, send a position command to move to the very bottom of the document, print some more text (Item 2), then send another position command that moves it back up to the middle of the document to print more text (Item 3). In such a case, from a visual, human perspective, "Item 1" is first, then "Item 3", then "Item 2". But from a PDF document stream perspective, the order is "Item 1", "Item 2", "Item 3".

If that sounds complicated, an easier way to show it is to open the OP's example PDF and try to select text starting at "+++BEGIN+++". It will not be intuitive because the order the text visually appears on the page is not the same order as it appears in the document stream.

PdfParser builds the output string by adding text to the end of it. When a position command moves the cursor up and/or to the left, this would require insertion of text into the string PdfParser has already built. Right now there is no way for PdfParser to know where to make that insertion. It really only knows: "did I move right more than X units? if so, add a space or tab", and "did I move down more than Y units? if so add a newline". There is no accounting for up and left.

To "fix" this would require a fundamental change in the way PdfParser collects the text from a PDF. Instead of building a string by adding items to the end, text would be slotted into a big matrix at the positions indicated by the document stream. Then when all done, a function would step through the matrix and then build the string.

Is doing that worth it? I don't really know. Some people I'm sure would find it valuable. However, it's also important to consider that if you Ctrl+A and copy the text from the sample document in Adobe Acrobat then paste it into a text editor, you get the same order of text as the current PdfParser gives. So Adobe is handling it the same way as PdfParser.

GreyWyvern avatar Feb 02 '24 15:02 GreyWyvern

Exactly. order of objects in PDF is not equal to visual order on page. that sample was with lightl exagregation (i have no right to share exact commercial documents) .

but we have tables that produced by softwares with diffirent subpositionings, and data form parsing such comes out of order. So by parsing such document strings appears in "random order" - and when need that data to analzye - that become not usable.

visually you see next text in pdf-apps "DocHe ade rP art TableH ea der Tab leD ata" but they are parsed by smalot parser as Mixed, and even with some Header objects parsed into middle of data-grid areas likewise : "DocHe rP ea ade Tab der leD art ata TableH " all letters exist but with similar reorderings - unreadable and unparsable.

poppler utils (pdfutils -layout ... ) doing quite accurate transformations from that documents, and produce text mostly in correct order (not checked all variations - but from all documents that i has tried got near to "wysiwyg"-ordered ) , however poppler is not web-related and not intended for web-app usage ...

DaLiV avatar Feb 02 '24 21:02 DaLiV

additionally - i has tried to overcome that issue with getDataTm

$strData = $page->getDataTm();
$pg=array();
foreach($strData as $obj) {
    $x=intval($obj[0][4]);
    $y=intval($obj[0][5]);
    $txt=trim($obj[1]);
    $pg[$y][$x]=$txt;
}
krsort($pg,SORT_NUMERIC);
foreach($pg as $i=>$pgln) {
 ksort($pg[$i],SORT_NUMERIC);
}
var_dump($pg);

for visual and directly left-to-right selectable part on every page in PDF is next "Ln Item no Description Inv qty U/M Sales price Amount" searched in parsed data - and there got next results: in 2.7.0 got table header as (page1 and page2 respectivelly) - quite usable ... array(7) { [14]=>string(2) "Ln" [42]=>string(11) "Item number" [105]=>string(11) "Description" [347]=>string(7) "Inv qty" [400]=>string(3) "U/M" [432]=>string(11) "Sales price" [539]=>string(6) "Amount" } array(4) { [14]=>string(10) "Ln Item no" [105]=>string(11) "Description" [347]=>string(23) "Inv qty U/M Sales price" [539]=>string(6) "Amount" } got partially usable parsing, but for tabledata positions $y on some subparts must have more rough rounding /as occurs line splittings by calculated values/

in 2.8.0 (page1 and page2 respectivelly) array(7) { [14]=>string(8) "Currency" [42]=>string(2) "Ln" [105]=>string(11) "Sales price" [347]=>string(11) "Item number" [400]=>string(7) "Inv qty" [432]=>string(3) "U/M" [539]=>string(11) "Description" } array(4) { [14]=>string(6) "Broker" [105]=>string(23) "Inv qty U/M Sales price" [347]=>string(10) "Ln Item no" [539]=>string(11) "Description" } also as every table line is out of order.

DaLiV avatar Feb 02 '24 21:02 DaLiV