pyvips pdfload for multiple pages doesn't work as documented

I was trying to load a pdf using pyvips.Image.pdfload("test.pdf", dpi=300, page=0, n=5, access="sequential") The documentation (https://libvips.github.io/pyvips/vimage.html#pyvips.Image.pdfload) states that it has Image and List[Image] as return types, so I assumed that loading multiple pages should return a list of images.

It would be even better if pyvips.Image.thumbnail("test.pdf[page=0,n=5]", 2500) would also return a list of Images!

What actually happens is that the pages are concatenated, one below the other:

>>> pyvips.Image.pdfload("test.pdf", dpi=300, page=0, n=5, access="sequential")
<pyvips.Image 2550x14084 uchar, 4 bands, srgb>
>>> pyvips.Image.thumbnail("test.pdf[page=0,n=10]", 2500)
<pyvips.Image 231x2500 uchar, 4 bands, srgb>

Vips Version: 8.14.1-1 (Debian unstable, recompiled with pdfium support) pyvips Version: 2.2.1

Jan 27 '23 14:01 JannKleen

Hi @JannKleen,

The pyvips return value documentation is autogenerated, and not very well. The list of returns is for operations which can return multiple results, for example:

$ python3
Python 3.10.7 (main, Nov 24 2022, 19:45:47) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyvips
>>> x = pyvips.Image.new_from_file("k2.jpg")
>>> x.max()
255.0
>>> x.max(x=True, y=True)
[255.0, {'x': 376, 'y': 2032}]

So max returns the maximum value by default, but you can ask for the position of the maximum too, in which case you'll get a list with the extra values, plus the default. This is documented here:

https://libvips.github.io/pyvips/intro.html#calling-libvips-operations

libvips represents all multipage images as tall, vertical strips, with a page-height metadata item giving the size of each frame. This is a bit odd when you first see it, but actually very good for performance, since you can usually avoid iterating over pages and therefore skip the pipeline setup/teardown cost.

The main pdfload docs have more detail:

https://www.libvips.org/API/current/VipsForeignSave.html#vips-pdfload

Jan 27 '23 14:01 jcupitt

There's a pagesplit convenience method which will cut a vertical image into pages for you:

https://libvips.github.io/pyvips/vimage.html?highlight=pagesp#pyvips.Image.pagesplit

So perhaps:

$ python3
Python 3.10.7 (main, Nov 24 2022, 19:45:47) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyvips
>>> pyvips.Image.thumbnail("nipguide.pdf[n=10]", 2500).pagesplit()
[<pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>, <pyvips.Image 1767x2500 uchar, 4 bands, srgb>]

Jan 27 '23 14:01 jcupitt

Thank you for the quick reply! That makes a lot more sense now.

I had a look at get_page_height() and I'm not sure it can deal with different page sizes though? If different page sizes would be an issue, would the fastest way to generate thumbnails for a pdf be something like this?

pdf_content = pdf.read()
for page_num in range(num_pages):
    thumb = pyvips.Image.thumbnail_buffer(pdf_content, 2500, option_string=f"[page={page_num}]").pngsave_buffer()

I just tried this and it seems to consume much more memory than my previous hacky implementation using my own pdfium bindings and feeding the buffers into vips. Is there some caching that I might want to disable?

Jan 27 '23 14:01 JannKleen

You're right, if pages change size you need to loop over n yourself. The libvips multipage system is really for volumetric images and I suppose animations.

libvips keeps a cache of the last 100 operations and will reuse results if it can. It works like common sub-expression elimination. You can control the cache size policy with these methods:

https://libvips.github.io/pyvips/voperation.html?highlight=max#pyvips.voperation.cache_set_max

So maybe:

# disable the libvips operation cache
pyvips.cache_set_max(0)

# this will be quick -- it'll just read the header, since pixels are only generated on demand
image = pyvips.Image.new_from_buffer(pdf_content, "")
num_pages = image.get_n_pages()

for page_num in range(num_pages):
    thumb = pyvips.Image.thumbnail_buffer(pdf_content, 2500, option_string=f"[page={page_num}]")
    thumb_png = thumb.pngsave_buffer()

Jan 28 '23 16:01 jcupitt