pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Ignore Form as well as Image XObjects when assembling the text array for a PDFObject.

Open rupertj opened this issue 2 months ago • 4 comments

Fix for #782

rupertj avatar Oct 30 '25 11:10 rupertj

Thank you for your PR.

Is it still work in progress?

If not, there are a few tasks left to solve before I take a closer look. Please read https://github.com/smalot/pdfparser/blob/master/CONTRIBUTING.md for more information.

k00ni avatar Nov 03 '25 07:11 k00ni

Thanks for the reminder @k00ni. I've added test coverage for the change.

rupertj avatar Nov 03 '25 09:11 rupertj

That change from "Imo" to "Im0" was just correcting a typo in the existing test. I didn't spot that I got that wrong when I wrote it.

I could revert that line and submit it as a separate PR if you like? I think keeping the new test coverage in the same method as the existing coverage makes sense, as they're testing the same bit of code.

rupertj avatar Nov 07 '25 10:11 rupertj

Also, to clarify: when the command in the test data is "/Imo Do", the test passes, but for the wrong reason. We're checking for no result for that XObject, and we get no result because it can't find an object called Imo.

When the command is "/Im0 Do", we still get no result, but we're getting it for the right reason. The code finds the XObject, sees that it's an image and then decides not to include it in the text array.

rupertj avatar Nov 07 '25 10:11 rupertj

Sorry for the delayed response.

I follow your arguments, it looks good to me. The documentation provided in #782 was very helpful.

k00ni avatar Nov 24 '25 07:11 k00ni

Thankyou!

rupertj avatar Nov 24 '25 09:11 rupertj