Saving TIFF in chunks
What did you do?
I need to save to save huge image files (approx. 819200x460800 RGBA). This is too much for anyone's RAM so I have to save it in chunks from disk. I start by saving the array to an HDF5 file. I then loop over the array in large steps, and parse a slice of the array into .fromarray(). I then save this to a tiff file. Once it loops again, it will add more to the tiff file and so on.
What did you expect to happen?
It should create an tiff image that is very large.
What actually happened?
It errored out while saving, giving me the error:
ile "c:\Users\Dexter\Desktop\workspace\test.py", line 68, in test
a.save("ff.tiff", format="tiff", quality=1)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\Image.py", line 2134, in save
save_handler(self, fp, filename)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 1629, in _save
offset = ifd.save(fp)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 865, in save
result = self.tobytes(offset)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 808, in tobytes
data = self._write_dispatch[typ](self, *values)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 644, in <lambda>
b"".join(self._pack(fmt, value) for value in values)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 644, in <genexpr>
b"".join(self._pack(fmt, value) for value in values)
File "C:\Users\Dexter\AppData\Local\Programs\Python\Python38\lib\site-packages\PIL\TiffImagePlugin.py", line 611, in _pack
return struct.pack(self._endian + fmt, *values)
struct.error: argument out of range
What are your OS, Python and Pillow versions?
- OS: Windows 10 V. 1909
- Python: 3.8.1
- Pillow: Latest (7.1.2)
def test():
f = h5py.File("test.hdf5", "w")
dset = f.create_dataset("test", (100000,100000,4), dtype=np.uint8, compression='gzip')
shp = dset.shape
step = 25000
for i in range(step, shp[0]+step, step):
a = Image.fromarray(dset[:i])
a.save("out.tiff", format="tiff", quality=80) #should error out here
del a
gc.collect()
f.close()
test()
I mention the size 819200x460800 - that's the maximum possible size. I also get the error on the size shown above. If the size of the image is lower (for instance, 10000x10000) with a step size of 1000, it will not error and will produce an output image in about 4 seconds.
Trying to replicate this, on Windows, I get
ValueError: array is too big;
arr.size * arr.dtype.itemsizeis larger than the maximum possible size.
on Ubuntu, I get
MemoryError: Unable to allocate array with shape (25000, 100000, 4) and data type uint8
on my macOS, it is just killed.
Is there some other code that you have run before the pasted code that prevents these errors?
Interesting. That isn't the full code, but I left out what I thought wouldn't be necessary. This is all of the code:
def factors(x):
result = []
i = 1
while i*i <= x:
if x % i == 0:
result.append(i)
if x//i != i:
result.append(x//i)
i += 1
return result
def get_step(shp):
fctrs = sorted(factors(shp[0]))[::-1]
i = 0
while True:
try:
a = np.zeros((fctrs[i], fctrs[i], 4))
return fctrs[i]
except MemoryError:
pass
i += 1
def test():
f = h5py.File("test.hdf5", "w")
dset = f.create_dataset("test", (100000,100000,4), dtype=np.uint8, compression='gzip')
shp = dset.shape
step = get_step(shp)
for i in range(step, shp[0]+step, step):
a = Image.fromarray(dset[:i])
a.save("out.tiff", format="tiff", quality=80)
del a
gc.collect()
f.close()
test()
What it does it it gets all the factors of the number in the shape of the numpy array (i.e. if it was shape 100,100,4 it gets the factors of 100). It then loops through the factors from highest to lowest, and finds the largest possible factor that will allow the numpy array to be split up into. This means, not only (should) will it not run out of memory, the numpy array will be split up evenly.
EDIT: On Ubuntu, your error mentions the shape of the array. It says it's shape (25000, 100000, 4). If I wanted it in chunks then technically I would want it like (25000, 25000, 4).
I changed this line:
a = Image.fromarray(dset[:i])
to
a = Image.fromarray(dset[:i,:i])
Still got the same error sadly.
It occurred to me what the issue was while I was trying it out with CV2. In the for loop, if I print the values (step=25000), it goes:
25000
50000
75000
10000
When I'm slicing the array, like arr[:i] that means slice from 0 to i. So once it gets to a value say, 75000, its slicing between 0 and 75000 and that's too much for memory, so it errors.
With a smaller array, you would be fooled into thinking it's working because it's a small array so the whole thing can be stored in memory. The for loop is now this:
for i in range(0, shp[0], step):
a = Image.fromarray(dset[i:i+step,i:i+step])
a.save("out.tiff", format="tiff", quality=80)
del a
gc.collect()
What it is doing now though, is it's overwriting the previous image data every time I save it. In the source code of the save function, I see an if statement that checks if there's an "append" parameter. I tried including that but it didn't work.
a.save("out.tiff", format="tiff", quality=80, params={"append", True})
Making it false also doesn't work.
The way to specify the "append" parameter that you have linked to is
a.save("out.tiff", format="tiff", quality=80, append=True)
However, when Pillow talks about appending, it's talking about adding another image. A second page of a PDF, for example. It may come as a surprise, but yes, TIFF can also contain multiple images.
Pillow isn't currently set up to be able to help you batch process a single image and then combine the result without loading the complete image into memory. If you would like to be able to do that, this is a feature request.
Yeah, reading the documentation pointed that out to me. And also the fact that the output tiff file is corrupted. I can only assume that every time it appends it creates a new header (or something along those lines) so when anything tries to read it, it looks corrupted.
To have this feature as a feature request, would I need to open a new issue?
No, you don't need to create a new issue. I was just pointing that out.
When I run your initial code,
import h5py
import gc
from PIL import Image
import numpy as np
def test():
f = h5py.File("test.hdf5", "w")
dset = f.create_dataset("test", (100000,100000,4), dtype=np.uint8, compression='gzip')
shp = dset.shape
step = 25000
for i in range(step, shp[0]+step, step):
a = Image.fromarray(dset[:i])
a.save("out.tiff", format="tiff", quality=80) #should error out here
del a
gc.collect()
f.close()
test()
I get
Traceback (most recent call last):
File "demo.py", line 20, in <module>
test()
File "demo.py", line 15, in test
a.save("out.tiff", format="tiff", quality=80, strip_size=65536*65536) #should error out here
File "PIL/Image.py", line 2440, in save
save_handler(self, fp, filename)
File "PIL/TiffImagePlugin.py", line 1857, in _save
offset = ifd.save(fp)
File "PIL/TiffImagePlugin.py", line 956, in save
result = self.tobytes(offset)
File "PIL/TiffImagePlugin.py", line 901, in tobytes
data = self._write_dispatch[typ](self, *values)
File "PIL/TiffImagePlugin.py", line 708, in <lambda>
b"".join(self._pack(fmt, value) for value in values)
File "PIL/TiffImagePlugin.py", line 708, in <genexpr>
b"".join(self._pack(fmt, value) for value in values)
File "PIL/TiffImagePlugin.py", line 675, in _pack
return struct.pack(self._endian + fmt, *values)
struct.error: 'L' format requires 0 <= number <= 4294967295
Pillow is calculating StripByteCounts as 10000000000. The tag can be a SHORT or LONG, but the maximum for LONG looks like 4294967295, less than 10000000000. So the error is just because a limit of the TIFF specification has been hit.
You might be interested to know that because your image isn't being saved with any compression, the quality argument isn't having any effect.
However, because your image isn't using any compression, the saving process is simpler. I've created #7650 to allow saving TIFF images without compression in chunks.
With that PR, the following should work.
from PIL import Image, TiffImagePlugin
im = Image.open("Tests/images/hopper.png")
with open("out.tiff", "wb") as fp:
for i, chunk in enumerate([
im.crop((0, 0, 128, 32)),
im.crop((0, 32, 128, 64)),
im.crop((0, 64, 128, 96)),
im.crop((0, 96, 128, 128)),
]):
if i == 0:
chunk.save(fp, "TIFF", tiffinfo={
TiffImagePlugin.IMAGEWIDTH: 128,
TiffImagePlugin.IMAGELENGTH: 128
})
else:
fp.write(chunk.tobytes())
Pillow 10.2.0 has now been released with #7650.
@DexterHill0 is this working now?
Following through the comments above, here is your last version.
import h5py
from PIL import Image
import numpy as np
import gc
def factors(x):
result = []
i = 1
while i*i <= x:
if x % i == 0:
result.append(i)
if x//i != i:
result.append(x//i)
i += 1
return result
def get_step(shp):
fctrs = sorted(factors(shp[0]))[::-1]
i = 0
while True:
try:
a = np.zeros((fctrs[i], fctrs[i], 4))
return fctrs[i]
except MemoryError:
pass
i += 1
def test():
f = h5py.File("test.hdf5", "w")
dset = f.create_dataset("test", (100000,100000,4), dtype=np.uint8, compression='gzip')
shp = dset.shape
step = get_step(shp)
for i in range(0, shp[0], step):
a = Image.fromarray(dset[i:i+step,i:i+step])
a.save("out.tiff", format="tiff", quality=80)
del a
gc.collect()
f.close()
test()
If you run the following with Pillow 10.2.0 with a reduced final size, it runs successfully, saving TIFF in chunks.
import h5py
from PIL import Image, TiffImagePlugin
import numpy as np
import gc
def factors(x):
result = []
i = 1
while i*i <= x:
if x % i == 0:
result.append(i)
if x//i != i:
result.append(x//i)
i += 1
return result
def get_step(shp):
return 2500
def test():
f = h5py.File("test.hdf5", "w")
dset = f.create_dataset("test", (10000,1000,4), dtype=np.uint8, compression='gzip')
shp = dset.shape
step = get_step(shp)
with open("out.tiff", "wb") as fp:
for i in range(0, shp[0], step):
print(i)
a = Image.fromarray(dset[i:i+step,i:i+step])
if i == 0:
a.save(fp, format="tiff", quality=80, tiffinfo={
TiffImagePlugin.IMAGEWIDTH: 10000,
TiffImagePlugin.IMAGELENGTH: 1000
})
else:
fp.write(a.tobytes())
del a
gc.collect()
f.close()
test()
https://www.itu.int/itudoc/itu-t/com16/tiff-fx/docs/tiff6.pdf
The largest possible TIFF file is 2**32 bytes in length.
This means that (100000, 100000, 4) cannot be saved as an uncompressed TIFF file.