pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Images contained in objects of type "/Pattern" are not retrieved

Open 0xNath opened this issue 2 months ago • 8 comments

Explanation

Hello, First of all, thanks for your works, it's a very helpful library.

I am not able to extract images from PDF generated with OnlyOffice : B2.pdf

After looking into the PDF structure, it seems that the image in this PDF page, is contained inside a Tiling Patterns object, which can't be handled by "_page._get_ids_image" nor "_page._get_image".

I've took a look at PDF standards and it's specified that Tiling Patterns can be made of images so it's not an OnlyOffice issue.

I don't have read completely the standards about Patterns, but once this is done I'd like to make a proposition to at least be able to retrieve images from them, so when we try to get images from a page, it also considers Patterns.

What do you think about it ?

Have a nice day !

0xNath avatar May 01 '24 10:05 0xNath

Thanks for the report. To determine the images associated with a page, pypdf does indeed not consider nested xobjects for image extraction.

stefan6419846 avatar May 01 '24 10:05 stefan6419846

pypdf can looks in sub XObjects, however here you are looking for an object which is part of a pattern which is not for me the way to do things. this is a proposal to extract your image:

import pypdf

r = pypdf.PdfReader("B2.pdf")
img = pypdf.filters._xobj_to_image(r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"])[2]
img.show()

I will try to propose also a easier way to extract an image edit. I've found a better way

pubpub-zz avatar May 01 '24 11:05 pubpub-zz

with the new PR extraction will be easier:

import pypdf
r = pypdf.PdfReader("B2.pdf")
img = r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"].decode_as_image()
img.show()

pubpub-zz avatar May 01 '24 12:05 pubpub-zz

Wouldn't it be better to have the fonction that should extract all images of a page to actually extract all images of the pages ?

The PDF standard said that images can be stored inside Patterns so we should expect to find images in them.

0xNath avatar May 01 '24 14:05 0xNath

I agree that images can be stored in patterns, but the solution used inhere is not common. a pattern is expected in a context to provided a repeated image in a surface. There is too many places where images could be (patterns, annotations, ...); will be quite complex also out of context having the image may not be very efficient.

pubpub-zz avatar May 01 '24 16:05 pubpub-zz

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

0xNath avatar May 01 '24 17:05 0xNath

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

We can propose a PR

pubpub-zz avatar May 01 '24 19:05 pubpub-zz

Well well well, _page.images isn't a method but a property so passing a parameter to it isn't an option...

0xNath avatar May 01 '24 20:05 0xNath