pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

PDFObject::getTextArray() shouldn't include XObject Forms.

Open rupertj opened this issue 2 months ago • 1 comments

This is a follow up to #733, but could also fix some other issues, like #671.

In #733, I wrote: "I'm not sure if it's appropriate to skip Form and PS types here or not". I now think skipping Form XObjects is the right thing to do. NB that a Form XObject isn't a form in the sense of a Form to capture data. It's a Form as in a Shape. (Section 8.10.1 of https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=217 helped my understanding here.)

I'd seen the consistent off-by-one error mentioned in #671 in the PDFs I'm working with. I have some code that's using the position of text to try to find the text inside a given annotation, and it was frequently finding the text to the left of the given annotation.

Each of my PDFs has a Form early on in each page's data, which I believe is the use case of "a form XObject may serve as the template for an entire page" mentioned in the spec.

Skipping the Form when creating the text array, much like we did for Images in the fix for #733, removes this off by one error for me.

MR and examples to follow.

rupertj avatar Oct 30 '25 10:10 rupertj

To demonstrate the issue, use this PDF:

Corporate Complaints Policy 2024.pdf

And this code:

include 'vendor/autoload.php';
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./Corporate Complaints Policy 2024.pdf');
$pages = $pdf->getPages();
print_r($pages[3]->getDataTm());

We'll concentrate on the 4th page (hence the index of 3).

Without the change in the PR, we get this output: before.txt

With the change, we get this output: after.txt

I'd recommend sticking them both in a tool that can do a visual diff, and comparing with the PDF. The change in the Y value (index 5) in the text matrix lines up much better with when the text wraps in the PDF in the after file than the before file.

EG: Image

In this screenshot, you can see the line break between "so that we can ensure" and "confidentiality and privacy." occurs when the change in Y from 423.19 to 409.39 happens in after.txt. It's one index out in before.txt.

rupertj avatar Oct 30 '25 12:10 rupertj