pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

getDataTm() provides wrong coordinates for text blocks

Open parpalak opened this issue 1 year ago • 1 comments

I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L785

Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L730 https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L685

However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.

Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page); helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.

parpalak avatar Sep 02 '24 21:09 parpalak

I also had this problem, and made a workaround for myself in this if: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L783

I changed it from

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.
    $text[] = $xobject->getText($page);
}

to

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.

    //Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
    $newText = $xobject->getText($page);
    if($newText === ' ') {
        break;
    }
    $text[] = $newText;
}

I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.

DominikDostal avatar Sep 03 '24 07:09 DominikDostal

I encountered the same issue. When using pdftohtml, I was able to obtain correct coordinates, which suggests the PDF file itself is fine. However, I noticed that the coordinates of two text elements were swapped in practice. This likely indicates a bug in some part of the code, though I don't have time to investigate and fix it thoroughly. I hope this observation can serve as a reference for others facing similar problems.

loveyu avatar May 16 '25 02:05 loveyu

@DominikDostal I think your solution isn't far off being the right thing to do.

From my own investigation, I can see exactly what both @DominikDostal and @parpalak have found - the Do commands included when assembling the text array for a page inserts too many entries in the array, and then everything after that in the array is out of sync, and the last few entries may not be used at all.

From the PDF spec, the Do command does this:

Paint the specified XObject. The operand name must appear as a key in the XObject subdictionary of the current resource dictionary (see Section 3.7.2, “Resource Dictionaries”). The associated value must be a stream whose Type entry, if present, is XObject . The effect of Do depends on the value of the XObject’s Subtype entry, which may be Image (see Section 4.8.4, “Image Dictionaries”), Form (Section 4.9, “Form XObjects”), or PS (Section 4.7.1, “PostScript XObjects”).

If I stick a breakpoint in the code at the point where PdfObject::getTextArray() handles 'Do' commands, I only ever see xObjects of type Image. If we change the code to ignore Image XObjects entirely when assembling the text array, that fixes the issue in the same way as @DominikDostal did, but would still allow the other XObject types to return text. I'm not sure if it's appropriate to skip Form and PS types here or not, but we can fix this bug where it's caused by images, and not alter the behaviour for other XObject types for the moment.

rupertj avatar Jul 16 '25 11:07 rupertj

The code in the linked MR works well for me. I now get all the expected text returned. I also had some work in progress to find the text that corresponds to link annotations, and that's producing much better results, which I think is due to the text positioning being better.

rupertj avatar Jul 16 '25 11:07 rupertj

@DominikDostal, @parpalak and the others: does #775 solve the problem?

k00ni avatar Jul 29 '25 14:07 k00ni

Sadly after 10 months I dont remember which document it was that I had this problem with, so i dont have any way to test it. Most documents did work just fine after all. But looking at the code changes and comparing it to what I did back then as a workaround this looks very promising.

DominikDostal avatar Jul 30 '25 06:07 DominikDostal