pyvips icon indicating copy to clipboard operation
pyvips copied to clipboard

Cannot create image from buffered IO stream

Open scossu opened this issue 5 years ago • 15 comments

This works:

import pyvips
with open('/tmp/test.tiff', 'rb') as fh: 
    img = pyvips.Image.new_from_buffer(fh.read(), "", access="sequential")

This:

import pyvips
with open('/tmp/test.tiff', 'rb') as fh: 
    img = pyvips.Image.new_from_buffer(fh, "", access="sequential")

raises:

TypeError: object of type '_io.BufferedReader' has no len()

The pyvips documentation says:

The memory object can be a string or buffer.

Am I using the wrong type of buffer? I also tried wrapping a byte string in a BytesIO object, with the same result. It seems like calling len() on any buffered stream will fail anyway.

The same happens with tiffload_buffer(), however there the documentation specifies only str for the "buffer" format.

Thanks for the help.

scossu avatar Mar 08 '19 21:03 scossu

Hi @scossu,

"buffer" is a very overloaded term in Python :( pyvips uses it to mean the buffer type:

https://docs.python.org/3/c-api/buffer.html

ie. it's a chunk of memory with a start and a length. So: numpy arrays, python strings, byte arrays etc. etc.

jcupitt avatar Mar 09 '19 11:03 jcupitt

I see. In that case:

  1. Shall the data type not rather be bytes, array, bytearray, etc. rather than str?
  2. Since I have to load the whole data set in memory to instantiate the Image object, doesn't that defeat the advantage of vips' internal streaming of the image data?

scossu avatar Mar 09 '19 16:03 scossu

Oh, and 3. Since most Python developers are more familiar with the Python (interpreter-level) terminology than the C API one, adjusting the documentation to the former would help with clarity. Thanks!

scossu avatar Mar 09 '19 16:03 scossu

Sure, it sounds as if it's confusing. What would you recommend as a clearer explanation?

At the moment, it's:

            data (str, buffer): The memory object to load the image from.

How about:

            data (bytes, bytearray, str, buffer): The memory object to load the image from.

On 2., the buffer object is the image in compressed form. So, for example, a 10k x 10k JPG image might be 15mb as passed to new_from_buffer, but 300mb as a huge, uncompressed image array. libvips can process the image without ever having the entire thing uncompressed at the same time.

There has been quite a bit of talk of adding true streaming (eg. from a socket), and there's a branch that adds this, but it's never been merged for various reasons. We should probably try again.

https://github.com/lovell/sharp/issues/30#issuecomment-46960443

jcupitt avatar Mar 09 '19 17:03 jcupitt

Though the real answer is that data can be anything that implements the Python buffer protocol. I'm not how to best express that.

jcupitt avatar Mar 09 '19 17:03 jcupitt

How about:

            data (bytes, bytearray, str, buffer): The memory object to load the image from.

I am also not aware of a super-type that includes all the types that the function accepts. In any case, I am not sure str would be appropriate because it's character encoded, not binary data. If array is supported that would be a convenient data type to pass and mention in the docs.

scossu avatar Mar 09 '19 17:03 scossu

So, for example, a 10k x 10k JPG image might be 15mb as passed to new_from_buffer, but 300mb as a huge, uncompressed image array. libvips can process the image without ever having the entire thing uncompressed at the same time.

Makes a lot of sense. Thanks for the clarification.

There has been quite a bit of talk of adding true streaming (eg. from a socket), and there's a branch that adds this, but it's never been merged for various reasons. We should probably try again.

That would be really nice. In my application, for example, I am getting an image using requests.get which can be loaded by chunks via iter_content. If I could pass a similar object to pyvips, that would be much more efficient. I believe it's a pretty common use case.

scossu avatar Mar 09 '19 17:03 scossu

Yes, it should be possible to support something like that, for jpg and png images at least.

I tried clarifying the docs.

jcupitt avatar Mar 09 '19 17:03 jcupitt

Thank you!

scossu avatar Mar 16 '19 17:03 scossu

Yes, it should be possible to support something like that, for jpg and png images at least.

Not TIFF?

scossu avatar Mar 16 '19 17:03 scossu

TIFF needs random access to read, so you can only decode it if you have the whole of the tiff image file there. You can read from memory or a file, but you can't read tiff from a socket (for example).

JPG and PNG can only be read from a socket if you avoid the progressive (interlaced) modes.

GIF can't really be read from a socket either: you can only find out how many pages it has by scanning the whole file, unfortunately.

It might be possible to stream webp and heic, I've not looked at them carefully enough.

jcupitt avatar Mar 17 '19 11:03 jcupitt

I just run into this issue too. After reading a bunch of documentation I tried to be smart by passing a mmap object to .new_from_buffer. mmap documentation says you can use mmap objects in most places where bytearray are expected.

What I try to do is to read an image from a stream into a tempfile and then create a pyvips image from it.

import mmap
import tempfile
import pyvips

stream = ...
fd = tempfile.SpooledTemporaryFile(max_size=1000)
while True:
  chunk = stream.read(10)
  if not chunk:
    break
  fd.write(chunk)
fd.seek(0)
buffer = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
image = pyvips.Image.new_from_buffer(buffer, options="")

Result is a TypeError: initializer for ctype 'void *' must be a cdata pointer, not mmap.mmap

Seeing that the conversation here went more into reading an image from a socket I still wonder if there is a way to create a pyvips image from a file like object or memory map?

stj avatar Apr 10 '19 18:04 stj

You can use read() to turn a mmap object into a buffer. This works for me:

import sys
import mmap
import pyvips

fd = open(sys.argv[1], "r")
buf = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
image = pyvips.Image.new_from_buffer(buf.read(), "")
print("width = {}, height = {}".format(image.width, image.height))

jcupitt avatar Apr 10 '19 19:04 jcupitt

My issue is that this will be the bytes of the image, not anything memory efficient. My goal is to avoid OOM issues if the stream sends large amount of data. That's why I write the stream data into a SpooledTemporaryFile. Admit that I make assumption here on how the data is handled in our out of memory based on how the pyvips image is created.

For now I'll try to write the data into a NamedTemporaryFile, if it is over a certain size, and then use .new_from_file. If it is under the threshold I keep the data in memory and use .new_from_buffer.

stj avatar Apr 10 '19 19:04 stj

I think it would be memory efficient: read() on mmap should just adapt the pointer, not do a malloc and copy. I am just guessing though -- you'd need to experiment, or perhaps read the source.

jcupitt avatar Apr 10 '19 21:04 jcupitt