pyvips
pyvips copied to clipboard
Most efficient region reading from tiled multi-resolution tiff images
Hi!
This is not really an issue, but more of advice seeking. The context is that I am working with histopathological images (known as whole slide images) that have a typical size of 100k x 100k pixels, and have therefore dedicated libraries to perform reading and writing. The most popular example is openslide, and then there is also asap. However, following a recent acquisition of an NVME disk with >3gb/s read speeds, I've noticed zero to negligible improvements in reading regions from such images, using these libraries. Therefore, I had to delve deeper to figure out the reasons. And this has led me to lower-level libraries, i.e. libvips.
Reading regions from these images is the general bottleneck of my methodology (as it happens many times per second) and as such, I need to speed it up as much as possible, up until another (probably CPU-bound) bottleneck appears. What i've done so far is that I benchmarked the "Image.new_from_file" method by reading the same large tiff image from three different disks, an HDD, an SSD, and then the two new NVMes.
So here are the questions: - Is this the most efficient way to do the reading, i.e. do we really need to read the whole image before reading the region? - Since these are pyramidal tiffs, is there a way to access a specific level/resolution? - Generally, do you think these times are appropriate for disks with the following read speeds and given that the image has a size of 2.4 GB (2,402,766,863 bytes)?
HDD: ~100MB/s SSD: ~500MB/s NVMe: ~2.8 GB/s
Hello @nd26,
WSI files are usually compressed with jpeg2000, so almost all your runtime will be spent inside your jp2k decoder. The speed of the disc won't really have any effect.
You'd need to share your benchmark for me to comment on the speed. Is this for machine learning? pyvips 2.1.6 / libvips 8.8 have a new region feature which can give a nice speedup for this type of application (generating 1000s of eg. 32x32 pixel areas). If it's for image display, libvips has a fast path for this already, have a look at vipsdisp. Yes, you can read out levels from the pyramid, have a look at the vips_openslideload()
docs.
https://github.com/jcupitt/vipsdisp
https://libvips.github.io/libvips/API/current/VipsForeignSave.html#vips-openslideload
OpenSlide uses OpenJPEG for decode, so you could investigate ways of speeding that up. There's been some talk of GPU acceleration for decode, and there are patches for AVX as well. They should help.
In terms of hardware, a huge CPU with many cores will help too.
For a simple benchmark, I'd try:
$ time vips avg CMU-1.svs
233.716228
real 0m38.827s
user 1m22.160s
sys 0m2.017s
ie. find the average pixel value. That should max your CPU. If it doesn't, there's some more speed to squeeze from your IO system.
Hi @jcupitt,
I appreciate your help. Thank you!
You'd need to share your benchmark for me to comment on the speed.
SSD
(pytorch_env) nd26@volta:~/Downloads$ time vips avg CMU-1.svs
233.716228
real 0m56.508s
user 0m57.908s
sys 0m3.824s
NVME
(pytorch_env) nd26@volta:~/Downloads$ time vips avg /nvme1/CMU-1.svs
233.716228
real 0m56.817s
user 0m57.760s
sys 0m4.100s
HDD
(pytorch_env) nd26@volta:~/Downloads$ time vips avg /media/nd26/data/CMU-1.svs
233.716228
real 0m56.622s
user 0m57.984s
sys 0m4.076s
Is this for machine learning?
Yes. Just to clarify, I do not need to simply preprocess the whole slide images a priori into tiles, but instead read patches on the go while the network is training, which is why I am trying to make the reading part as efficient as possible.
pyvips 2.1.6 / libvips 8.8 have a new region feature which can give a nice speedup for this type of application (generating 1000s of eg. 32x32 pixel areas).
Is this what it's been used when I employ the "crop" method?
OpenSlide uses OpenJPEG for decode, so you could investigate ways of speeding that up. There's been some talk of GPU acceleration for decode, and there are patches for AVX as well. They should help.
Could you elaborate on this? I am more than interested.
In terms of hardware, a huge CPU with many cores will help too.
Any recommendations?
Thanks again!
You wrote:
real 0m56.622s user 0m57.984s
That's curious, you should see user about 2x real on a two-core machine. What processor are you using?
libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.
Here's a benchmark:
#!/usr/bin/python3
import sys
import pyvips
image = pyvips.Image.new_from_file(sys.argv[1])
patch_size = 64
n_across = image.width // patch_size
n_down = image.height // patch_size
x_max = n_across - 1
y_max = n_down - 1
n_patches = 0
for y in range(0, n_down):
print("row {} ...".format(y))
for x in range(0, n_across):
patch = image.crop(x * patch_size, y * patch_size,
patch_size, patch_size)
patch_f = patch.fliphor()
patch0 = patch.write_to_memory()
patch1 = patch.rot90().write_to_memory()
patch2 = patch.rot180().write_to_memory()
patch3 = patch.rot270().write_to_memory()
patch4 = patch_f.write_to_memory()
patch5 = patch_f.rot90().write_to_memory()
patch6 = patch_f.rot180().write_to_memory()
patch7 = patch_f.rot270().write_to_memory()
n_patches += 8
print("{} patches generated".format(n_patches))
So this is building a new pipeline for every patch, executing it, and shutting it down. On this machine, it generates 12,512 patches in 41s from CMU-1-Small-region.svs
.
Here's a version using the new libvips 8.8 fetch
feature:
#!/usr/bin/python3
import sys
import pyvips
def fetch(region, patch_size, x, y):
return region.fetch(patch_size * x, patch_size * y, patch_size, patch_size)
image[0] = pyvips.Image.new_from_file(sys.argv[1])
image[1] = image0.rot90()
image[2] = image0.rot180()
image[3] = image0.rot270()
image[4] = image[0].fliphor()
image[5] = image[4].rot90()
image[6] = image[4].rot180()
image[7] = image[4].rot270()
reg = [pyvips.Region.new(x) for x in image]
patch_size = 64
n_across = image0.width // patch_size
n_down = image0.height // patch_size
x_max = n_across - 1
y_max = n_down - 1
n_patches = 0
for y in range(0, n_down):
print("row {} ...".format(y))
for x in range(0, n_across):
patch0 = fetch(reg[0], patch_size, x, y)
patch1 = fetch(reg[1], patch_size, y_max - y, x)
patch2 = fetch(reg[2], patch_size, x_max - x, y_max - y)
patch3 = fetch(reg[3], patch_size, y, x_max - x)
patch4 = fetch(reg[4], patch_size, x_max - x, y)
patch5 = fetch(reg[5], patch_size, y_max - y, x_max - x)
patch6 = fetch(reg[6], patch_size, x, y_max - y)
patch7 = fetch(reg[7], patch_size, y, x)
n_patches += 8
print("{} patches generated".format(n_patches))
Now this is making 8 pipelines and fetching patches from them. It's not rebuilding the pipeline each time, so it's much quicker. This version makes 12,512 patches in 0.5s from CMU-1-Small-region.svs
.
re. OpenJPEG, have a look at at the mailing list, there has been some discussion of AVX and OpenCL acceleration, plus some branches with code.
That's curious, you should see user about 2x real on a two-core machine. What processor are you using?
description: CPU
product: Intel(R) Core(TM) i3-8350K CPU @ 4.00GHz
libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.
I generate patches with a resolution that is a multiple of 224x224. 448x448 and 896x896 pixels are the most common. It's a hyperparameter.
Here's a version using the new libvips 8.8 fetch feature:
I've ran into the "libvips too old" error while trying to test this. Installation seems to have went well, but not integrated into anaconda. I've downloaded "vips-8.8.0-rc3.tar.gz" from https://github.com/libvips/libvips/releases/tag/v8.8.0-rc3 and followed the typical configure-make-install procedure with no errors. Do you know why this might be the case?
re. OpenJPEG, have a look at at the mailing list, there has been some discussion of AVX and OpenCL acceleration, plus some branches with code.
Where can I find their mailing list? Sorry, I couldn't find it.
I'd guess you have several libvipses installed and it's picking up the wrong one. Try:
import logging
logging.basicConfig(level=logging.DEBUG)
import pyvips
And you'll see some diagnostics.
ojg mailing list:
https://groups.google.com/forum/#!forum/openjpeg/join
Hi @jcupitt,
I just had the chance to get back to this. I've managed to run your code, and indeed, the latter (with fetch) is impressively faster. Correct me if I am wrong, but I think the following way is the only way I can take advantage of the fetch method for my problem.
If I am not mistaken, the speedup is due to having the pipelines set up once, instead of every time a patch is needed. However, in my application, a single patch is retrieved from a single whole slide image. This happens around 10 (.. the batch size) times a second from different whole slide images. I cannot know a priori which patches will need to be extracted, and since patches are extracted sequentially from all whole slide images, the only way I can see the above working is to have a pipeline setup for each whole slide image before the training commences, and then when a patch needs to be extracted, to simply retrieve the pipeline and use the fetch function instead of "read_region". Is this correct?
Empirically, do you think this will speedup the reading operation? Note that the patches have 896 x 896 pixels.
Those are big patches!
How many slide images will you be working from?
Those are big patches!
Indeed!
How many slide images will you be working from?
Around 650!
650 isn't so many. I would open them all (ie. make 650 pipelines) and then use fetch to pull out tiles.
You could try using crop as well, but fetch would probably be quicker.
Hopefully you have loads of memory!
650 isn't so many. I would open them all (ie. make 650 pipelines) and then use fetch to pull out tiles.
You could try using crop as well, but fetch would probably be quicker.
Hopefully you have loads of memory!
Alright, I will do that once the current experiment ends. I have 32GB RAM, and 24GB GPU VRAM. Hopefully, it will be enough.
By the way, you said:
That's curious, you should see user about 2x real on a two-core machine. What processor are you using?
Is this an issue/suboptimality given the following CPU?
description: CPU product: Intel(R) Core(TM) i3-8350K CPU @ 4.00GHz
Try --vips-progress
, it gives some more info about computation:
$ time vips avg CMU-2.svs --vips-progress
vips temp-1: 78000 x 30462 pixels, 8 threads, 128 x 128 tiles, 256 lines in buffer
vips temp-1: done in 26.8s
230.618127
real 0m26.848s
user 2m3.476s
sys 0m1.818s
So I have 4 cores, 8 threads and see ~4x speedup.
@nd26 did you manage to get any boost in reading performance using pyvips instead of openslide in the batch generator? I'm studying the same thing right now, but I'm not finding openslide to be that slow tbh. Of course storing patches in a hdf5-file and reading from that directly is faster, but not more than about 2x. And isn't openslide already using libvips?
You wrote:
real 0m56.622s user 0m57.984s
That's curious, you should see user about 2x real on a two-core machine. What processor are you using?
libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.
Here's a benchmark:
#!/usr/bin/python3 import sys import pyvips image = pyvips.Image.new_from_file(sys.argv[1]) patch_size = 64 n_across = image.width // patch_size n_down = image.height // patch_size x_max = n_across - 1 y_max = n_down - 1 n_patches = 0 for y in range(0, n_down): print("row {} ...".format(y)) for x in range(0, n_across): patch = image.crop(x * patch_size, y * patch_size, patch_size, patch_size) patch_f = patch.fliphor() patch0 = patch.write_to_memory() patch1 = patch.rot90().write_to_memory() patch2 = patch.rot180().write_to_memory() patch3 = patch.rot270().write_to_memory() patch4 = patch_f.write_to_memory() patch5 = patch_f.rot90().write_to_memory() patch6 = patch_f.rot180().write_to_memory() patch7 = patch_f.rot270().write_to_memory() n_patches += 8 print("{} patches generated".format(n_patches))
So this is building a new pipeline for every patch, executing it, and shutting it down. On this machine, it generates 12,512 patches in 41s from
CMU-1-Small-region.svs
.Here's a version using the new libvips 8.8
fetch
feature:#!/usr/bin/python3 import sys import pyvips def fetch(region, patch_size, x, y): return region.fetch(patch_size * x, patch_size * y, patch_size, patch_size) image[0] = pyvips.Image.new_from_file(sys.argv[1]) image[1] = image0.rot90() image[2] = image0.rot180() image[3] = image0.rot270() image[4] = image[0].fliphor() image[5] = image[4].rot90() image[6] = image[4].rot180() image[7] = image[4].rot270() reg = [pyvips.Region.new(x) for x in image] patch_size = 64 n_across = image0.width // patch_size n_down = image0.height // patch_size x_max = n_across - 1 y_max = n_down - 1 n_patches = 0 for y in range(0, n_down): print("row {} ...".format(y)) for x in range(0, n_across): patch0 = fetch(reg[0], patch_size, x, y) patch1 = fetch(reg[1], patch_size, y_max - y, x) patch2 = fetch(reg[2], patch_size, x_max - x, y_max - y) patch3 = fetch(reg[3], patch_size, y, x_max - x) patch4 = fetch(reg[4], patch_size, x_max - x, y) patch5 = fetch(reg[5], patch_size, y_max - y, x_max - x) patch6 = fetch(reg[6], patch_size, x, y_max - y) patch7 = fetch(reg[7], patch_size, y, x) n_patches += 8 print("{} patches generated".format(n_patches))
Now this is making 8 pipelines and fetching patches from them. It's not rebuilding the pipeline each time, so it's much quicker. This version makes 12,512 patches in 0.5s from
CMU-1-Small-region.svs
.
@jcupitt can you write this same code in C language for generating patches using libvips functions: vips_image_new_from_file(),vips_crop() ,vips_image_write_to_file()?
It's pretty easy, just build the pipeline, then use vips_region_fetch()
to pull pixels out of it:
https://libvips.github.io/libvips/API/current/VipsRegion.html#vips-region-fetch
It's pretty easy, just build the pipeline, then use
vips_region_fetch()
to pull pixels out of it:https://libvips.github.io/libvips/API/current/VipsRegion.html#vips-region-fetch
@jcupitt Thanks for helping me out. I have used libvips to generate patches. But,when I included the function vips_region_fetch, it says implicit declaration of vips_region_fetch function. Here is the error:
test.c:16:18: warning: implicit declaration of function ‘vips_region_fetch’; did you mean ‘vips_region_black’? [-Wimplicit-function-declaration]
VipsPel *patch=vips_region_fetch(region,0,0,width,height,width*height);
^~~~~~~~~~~~~~~~~
vips_region_black
test.c:16:18: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
/tmp/cc6OF4Sh.o: In function `main':
/content/test.c:16: undefined reference to `vips_region_fetch'
collect2: error: ld returned 1 exit status
In which header file vips_region_fetch() function is defined?
It's part of the standard API from libvips 8.8, I think. Perhaps you have an old version installed?
I have installed libvips 8.4.5.
You'll need to update -- your libvips is five years old.
Yeah!! I have updated it now to libvips 8.8 and its working fine. Thank you.
@jcupitt I applied fetch
method to 4 types of WSI for 2048*2048 patch, mrxs, tiff, ndpi, svs, they both use JPEG compression schema. their fetch time for one single patch as follows:
mrxs: average = 0.271734s
tiff: average = 0.852996s
ndpi: average = 0.318353s
svs: average = 0.329577s
Why is tiff slower than the others?
Hi @wenqf11,
fetch
is quick for small tiles and random reads, especially if you have several pipelines you want to run at once (the example above generates the 8 rotates and flips for each 32x32 patch). This is often what you'd use to train a classifier. If you want large tiles (eg. your 2048x2048), then crop
is much quicker.
This is because of how they work. fetch
skips all the libvips pipeline setup and teardown, so it can help if only a small number of pixels are going to be computed. But because there's no pipeline, it can't do much parallelization. crop
builds a full pipeline for each tile, so start and stop are slower, but it can compute pixels in parallel.
Here's a benchmark:
#!/usr/bin/python3
import random
import time
import sys
import pyvips
if len(sys.argv) != 4:
print("usage: ./fetch-vs-crop.py IMAGE SIZE N-TILES")
size = int(sys.argv[2])
n_tiles = int(sys.argv[3])
print(f"{n_tiles} tiles, each {size}x{size} pixels")
image = pyvips.Image.new_from_file(sys.argv[1])
print(f"random read")
start = time.time()
for _ in range(n_tiles):
x = random.randint(0, image.width - size)
y = random.randint(0, image.height - size)
tile = image.crop(x, y, size, size)
# necessary to force evaluation
avg = tile.avg()
end = time.time()
print(f"crop: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")
start = time.time()
region = pyvips.Region.new(image)
for _ in range(n_tiles):
x = random.randint(0, image.width - size)
y = random.randint(0, image.height - size)
p = region.fetch(x, y, size, size)
end = time.time()
print(f"fetch: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")
print(f"sequential read")
image = pyvips.Image.new_from_file(sys.argv[1])
start = time.time()
tile_number = 0
# most tile formats are column-major internally
for y in range(0, image.height - size, size):
for x in range(0, image.width - size, size):
tile = image.crop(x, y, size, size)
# necessary to force evaluation
avg = tile.avg()
tile_number += 1
if tile_number > n_tiles:
break
if tile_number > n_tiles:
break
end = time.time()
print(f"crop: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")
start = time.time()
tile_number = 0
region = pyvips.Region.new(image)
for y in range(0, image.height - size, size):
for x in range(0, image.width - size, size):
x = random.randint(0, image.width - size)
y = random.randint(0, image.height - size)
p = region.fetch(x, y, size, size)
tile_number += 1
if tile_number > n_tiles:
break
if tile_number > n_tiles:
break
end = time.time()
print(f"fetch: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")
I see:
$ ./fetch-vs-crop.py ~/pics/openslide/CMU-1.svs 32 1000
1000 tiles, each 32x32 pixels
random read
crop: time per tile = 2.70ms
fetch: time per tile = 2.01ms
sequential read
crop: time per tile = 1.94ms
fetch: time per tile = 1.92ms
So for one small tile (no flips and rotates), fetch
is quicker for random reads. They are the same speed for sequential reads, since the cache helps both equally. fetch
would be faster if there were flips as well.
The situation changes with large tiles:
$ ./fetch-vs-crop.py ~/pics/openslide/CMU-1.svs 2048 100
100 tiles, each 2048x2048 pixels
random read
crop: time per tile = 21.40ms
fetch: time per tile = 130.32ms
sequential read
crop: time per tile = 19.11ms
fetch: time per tile = 129.85ms
Now fetch is very slow because it can't parallelize. There's little difference between sequential and random, since the libvips pixel caches won't help.
tiff is slower for parallel JPEG read since the libtiff JPEG decoder is single-threaded.
@jcupitt ok, thanks a lot!
patch0 = patch.write_to_memory() patch1 = patch.rot90().write_to_memory() patch2 = patch.rot180().write_to_memory() patch3 = patch.rot270().write_to_memory() patch4 = patch_f.write_to_memory() patch5 = patch_f.rot90().write_to_memory() patch6 = patch_f.rot180().write_to_memory() patch7 = patch_f.rot270().write_to_memory()
Hi, I am trying to use this code for Deep Learning Training. How to convert these patches, say patch0, to a numpy array?
Hi @lafith,
Please open a new issue for new questions -- it'll make it easier for other people to find answers.