pyvips
pyvips copied to clipboard
Creating enormous images (BIGTIFF), use intermediate images or write straight to output?
What am I trying to do Make a huge image from Gaia's Milky Way point cloud data
tl;dr
Simplified: I'm able to create x by y sized tiles in memory quickly. I want to join as many of these as possible into the largest resulting image as possible. Should I write to intermediary files, if so, should they be small or large, and what format? If I can keep those in memory and continue to write to a single large resulting file, how would I go about that?
Issue I'm running into I'm trying to optimize intermediate file format and size, read/write to disc, and general process to build the largest result image as fast and efficiently as possible. Sounds like a job for libvips :)
My best attempt
Right now I've been making very large intermediate images that are 65500x4096. I'm saving from a numpy array to vips and then tiffsave with lzw compression. It takes ~2.5s and uses 2.5GB of memory. Pro's here are tiny file size output, only 2mb, but high memory usage and relatively slow export time (maybe that's up to libtiff).
The problem is that I'm trying to do this for thousands of intermediate images :)
I then use arrayjoin which has crazy low memory usage (thank you) and write to a new bigtiff which takes ~30s for 5 files.
My random set of questions
- If I compress my intermediary files and then
arrayjointhem later, will they be decoded and re-encoded? Can I join tiff files that are already compressed without reprocessing (re-compressing) them? Ex: the resulting tiff is a set of tiles of my already-compressed intermediary images - I've tried using vips format for intermediary images which speeds up the joins later, but they are too big (2gb's vs 2mb tiff) and the write time is pretty slow. What size and what format intermediary files do you think would be best?
- Do I even need intermediate images? Could I just keep writing to my final resulting image since I can generate portions (tiles) of the image quickly in memory? How would I do this?
Hi @cookmscott, this sounds very cool.
We keep meaning to add a thing to make it possible to generate tiles on demand with eg. numpy, but it hasn't happened yet :( So for now, you need to make the tiles on demand with pyvips, or keep rendered tiles in memory, or keep rendered tiles on disc.
It sounds like rendering the Gaia dataset in pyvips isn't a great idea, so I think saving them to intermediates is a good plan. I guess these things are lots of zeros with a few set pixels (is that right?) so LZW is a good choice. You could experiment with zstd as well I suppose.
I would render a set of regular non-overlapping tiles to a few thousand compressed TIFFs, then assemble with arrayjoin and save as a pyramidal tiled tiff or deepzoom image.
Hi @jcupitt thanks for the quick response. Yes, it's a lot of zero's and a few pixels (star / celestial objects) interspersed. The images are pretty homogenous so lzw does a great job compressing.
For your last comment, if I create my own tiles and save to disc as compressed tiffs, would arrayjoin have to uncompress + decode + join + re-write to disc or is it smart in the sense that if I specify my tile size to be the same as the input images, it will just plop them in (for lack of a better term)? In other words, can I tell arrayjoin to "plop contents of this file as this tile and move on" so it doesn't have to run through what I previously mentioned?
I'm still catching up on all the lingo btw!
It'll decompress and recompress, but it's smart enough to do this on several threads, so as long as you have a few cores it should be fast enough.
libtiff doesn't (in general) let you copy compressed tiles directly, since they are often split into common parts (eg. huffman tables) and unique parts (lists of coefficients).
You can do the arrayjoin on the command-line, eg. perhaps:
vips arrayjoin "$(echo strip-*.tif)" final.tif[tile,compression=lzw,pyramid] --across 1
To join a huge set of strips vertically, for example. You might need to add a sort in there, depending on your strip naming scheme.