Pillow icon indicating copy to clipboard operation
Pillow copied to clipboard

Increase buffer size to speedup `Image.tobytes()`

Open lgeiger opened this issue 3 months ago • 15 comments

Image.tobytes() is used in the __array_interface__ when images are passed to numpy using np.asarray(...). Converting PIL images to numpy is very common, e.g. ML libraries like vllm or some pytorch dataloaders also commonly rely on this process.

For large images this can be quite slow and can become a bottleneck since .tobytes() encodes the data in fixed chunks which need to be joined afterwards: https://github.com/python-pillow/Pillow/blob/d42e537efeb1bd11cd9df1db1c7d7a6dc529d9e2/src/PIL/Image.py#L798-L808

This PR increases the buffersize to match the image size when using the default raw encoder instead of using a fixed value. In most cases this allows to encode the image in a single chunk which speeds up encoding of large images by over 2x:

image size main this PR main / this PR
128x128 4.54 μs 4.65 μs 0.98
256x256 18.2 μs 13.2 μs 1.38
512x512 60.2 μs 46.1 μs 1.31
1024x1024 382 μs 245 μs 1.56
2048x2048 1.97 ms 1.16 ms 1.70
4096x4096 10.6 ms 5.49 ms 1.93
8192x8192 54.3 ms 22.8 ms 2.38
16384x16384 230 ms 92.3 ms 2.49

Benchmarked with the following ipython script:

import numpy as np
from PIL import Image

for size in (128, 256, 512, 1024, 2048, 4096, 8192, 16384):
    img = np.random.randint(0, 256, size=(size, size, 3), dtype=np.uint8)
    img = Image.fromarray(img)

    print(f"{size}x{size}")
    %timeit img.tobytes()

lgeiger avatar Sep 23 '25 11:09 lgeiger

Just to mention for anyone else reading this - individual users can already vary their individual experience by setting MAXBLOCK themselves.

from PIL import ImageFile
ImageFile.MAXBLOCK = 65536 * 4

radarhere avatar Sep 23 '25 13:09 radarhere

Have you tried using the Arrow interface, which is zero-copy?

wiredfool avatar Sep 23 '25 16:09 wiredfool

Have you tried using the Arrow interface, which is zero-copy?

My use case still requires a numpy array as output (or at least access to the raw bytes). How would I do this from a user perspective with the Arrow interface? Currently I'm just using np.asarray(img) which calls .tobytes() under the hood.

lgeiger avatar Sep 23 '25 18:09 lgeiger

I would guess

import numpy as np
import pyarrow as pa
from PIL import Image
img = Image.new("RGB", (12, 12))
np.array(pa.array(img))

but it doesn't seem faster to me.

radarhere avatar Sep 30 '25 13:09 radarhere

np.array always does a copy so we wouldn't really gain much. pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case so zero_copy_only=False or np.array() would be needed which seems to be very slow.

Here's a quick benchmark with this PR:

import numpy as np
import pyarrow as pa
from PIL import Image

rng = np.random.default_rng(42)

for size in (128, 256, 512, 1024, 2048, 4096, 8192, 16384):
    img = rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8)
    img = Image.fromarray(img)

    print(f"{size}x{size}")
    %timeit img.tobytes()
    %timeit np.asarray(img)
    %timeit pa.array(img)
    %timeit pa.array(img).to_numpy(zero_copy_only=False)
128x128
4.61 μs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.53 μs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.28 μs ± 4.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.4 ms ± 3.29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
256x256
13.1 μs ± 86.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
13.9 μs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.3 μs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
5.86 ms ± 14.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
512x512
45.8 μs ± 77.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
46.8 μs ± 95.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.27 μs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
24 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1024x1024
246 μs ± 1.82 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
263 μs ± 6.77 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.28 μs ± 5.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
97.8 ms ± 215 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2048x2048
1.15 ms ± 13.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.16 ms ± 15 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.28 μs ± 6.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
395 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For image sizes of 4096x4096 it also seems like pyarrow fails with ValueError: Image is in multiple array blocks, use imaging_new_block for zero copy.

So in summary using pyarrow arrays are much faster, but going from pyarrow to numpy is very slow.

lgeiger avatar Sep 30 '25 14:09 lgeiger

pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case

Just in case it is something interesting that we should consider in the future, could you explain this slightly more?

radarhere avatar Oct 01 '25 13:10 radarhere

pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case

Just in case it is something interesting that we should consider in the future, could you explain this slightly more?

The following code would raise ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

img = Image.fromarray(np.random.randint(0, 255, size=(128, 128, 3), dtype=np.uint8))
arr = pa.array(img)
np_arr = pa.array(img).to_numpy()

I haven't used pyarrow before, but judging from the docs it raises because "the conversion to a numpy array would require copying the underlying data (e.g. in presence of nulls, or for non-primitive types) " and since zero_copy_only was True.

lgeiger avatar Oct 01 '25 13:10 lgeiger

Just to mention for anyone else reading this - individual users can already vary their individual experience by setting MAXBLOCK themselves.

The main benefit of this PR is that since the buffer size is dependent on the image size users will still have low memory usage for small images but benefit from a larger buffer size for large images.

Let me know if you have any concerns that would prevent merging of the PR.

lgeiger avatar Oct 03 '25 11:10 lgeiger

@radarhere Any updates on when/whether this PR could be merged? Or are there any additional benchmarks that you would like me to run?

lgeiger avatar Oct 09 '25 16:10 lgeiger

@lgeiger We'll probably need @wiredfool to look closer too.

aclark4life avatar Oct 10 '25 00:10 aclark4life

The original point of this particular bit of code was to have predictable memory usage when running tobytes, in this case, either Image.MAX_BLOCK or at the very least, the size of one row. (as the shuffler is row based, and if the buffer is smaller than the row, no progress can be made.)

So, where previously we needed 2xImageMemory + 64k, now we need 3xImageMemory.

For smaller images, it's not a problem, but for larger images this may cause memory pressure where we didn't have it before. I'd consider that a regression.

One alternate here is to change the calculation so that it's the min of max(MAX_BLOCK, row_size) and the image size, at which point the MAX_BLOCK can be boosted without allocating excessive memory in the small image case.

There may other places where the MAX_BLOCK shouldn't be large, and they'd need some similar checks.

wiredfool avatar Oct 10 '25 10:10 wiredfool

@wiredfool Thanks for taking a look. I agree increasing MAXBLOCK globally would be a problem since it increases memory usage, but for tobytes() I'm not sure this is actually the case.

The way I understand the code is the following: All chunks are appended to a list which is joined afterwards which causes the 2x ImageMemory usage on the Python side you mentioned above. https://github.com/python-pillow/Pillow/blob/6d6f0496d9bc9b044d2ce8542bcc1c0b31dbf845/src/PIL/Image.py#L798-L808

I don't think the max memory usage would include the additional 64k buffer, but I haven't looked at what the C code is actually doing so I might be wrong.

In any case, for this PR the output list only consists of a single item (assuming the buffer size was correct/large enough) preventing the need for allocating a new bytes object during the join. So the memory usage of the Python code would actually half. I thought about changing the code to directly return data but I wasn't sure whether I'm missing any edge cases where the the actual size would be larger than the buffer size estimate added here.

I double checked this with a memory profile and viewed the memory usage with memray summary <filename>:

import memray
import numpy as np
from PIL import Image

rng = np.random.default_rng(42)

def get_image(size):
    return Image.fromarray(rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8))

for size in (512, 1024, 2048, 4096, 8192, 16384):
    img = get_image(size)
    with memray.Tracker(f"pr_{size}.bin"):
        img.tobytes()

And the results show that this PR halves the memory usage which matches my theory from above:

image size memory (main) memory (this PR) allocations (main) allocations (this PR)
512x512 1.576MB 788.001kB 16 17
1024x1024 6.300MB 3.149MB 52 2
2048x2048 25.197MB 12.589MB 209 2
4096x4096 100.775MB 50.344MB 824 2
8192x8192 403.174MB 201.351MB 4100 2
16384x16384 1.613GB 805.356MB 16388 2

lgeiger avatar Oct 10 '25 13:10 lgeiger

Ok, I was wrong here, the return b"".join(output) looks like it's just stringing together all the buffers with pointer magic. So the advantage you've got here is that you're basically turning that into a noop and making bigger allocations earlier. OTOH, with all the reallocs, I bet there's a buffer trimming going on in the C layer to make it joinable without doing a slice on the buffer in python space.

What might make sense is for there to be an encode_tobytes, similar to the other encode_to functions in encode.c, but essentially do all the looping in C space returning a single PyBytes. Not sure it would help completely, but it would take all that logic in python and move it down.

Aside -- I don't think the memray is measuring what you think it is -- I get different results when I replace

rng = np.random.default_rng(42)

def get_image(size):
    return Image.fromarray(rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8))

with

def get_image(size):
    return Image.new('RGBA', (size,size), (0,1,2,3))

But I don't think it changes the conclusion there.

wiredfool avatar Oct 11 '25 07:10 wiredfool

What might make sense is for there to be an encode_tobytes, similar to the other encode_to functions in encode.c, but essentially do all the looping in C space returning a single PyBytes. Not sure it would help completely, but it would take all that logic in python and move it down.

What would the benefit of this be compare to using a bufsize equal to the image size in encode? The best case memory usage of .tobytes() will always be 1x ImageMemory which is already very close to the memory measurements measured above. Or are there any benefits of doing the encoding in small chunks that I'm missing here? I actually tried something like this initially but I didn't really saw any benefits performance wise over this solution which requires much less code.

Aside -- I don't think the memray is measuring what you think it is -- I get different results when I replace

I'm not an expert with memray, could you elaborate? In your code example your using a 4-channel RGBA image where I tested a 3-channel RGB image from numpy. So if the memory usage in the Image.new('RGBA', (size,size), (0,1,2,3)) case is 4/3 higher that is expected. Or are you talking about other differences?

lgeiger avatar Oct 11 '25 20:10 lgeiger

@wiredfool Friendly ping, do you mind having a look at my comments from above?

lgeiger avatar Nov 02 '25 16:11 lgeiger