dask-image Example image data

We want to use a histology slide image from the 2016 Camelyon dataset (CC0): https://camelyon17.grand-challenge.org/Data/

This thread contains details specific to this dataset.

Related to the larger discussion here: https://github.com/dask/dask-image/issues/107

Aug 05 '19 22:08 GenevieveBuckley

Comparing the different compression algorithms and compression levels, it was found that zstd with a compression level = 9 (max) reduced a 381.5MB file to 507KB was the most effective.

import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='zlib', clevel=9, shuffle=Blosc.BITSHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
print(z.compressor)
print(z.info)

Code used from https://zarr.readthedocs.io/en/stable/tutorial.html

Summary of Results: detailed compression results.pdf

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 3379344 (3.2M)
Storage ratio      : 118.4
Chunks initialized : 100/100
Note: 2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 519332 (507.2K)
Storage ratio      : 770.2
Chunks initialized : 100/100
Note: 45 seconds


(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13543704 (12.9M)
Storage ratio      : 29.5
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13788015 (13.1M)
Storage ratio      : 29.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5137515 (4.9M)
Storage ratio      : 77.9
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5129740 (4.9M)
Storage ratio      : 78.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

Aug 06 '19 00:08 timbo8

Thank you @timbo8! This is very helpful for us :)

Aug 06 '19 00:08 GenevieveBuckley

@timothywallaby has done a bunch of work at the pyconau sprints working out how to save openslide images as compressed zarr files. This gist shows how to do that if you can fit all the array into memory. Thanks @timothywallaby!

We're working on how to append to zarr arrays now, for cases where you cannot fit the entire image into memory.

Aug 06 '19 04:08 GenevieveBuckley

The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py

The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.

Aug 07 '19 08:08 GenevieveBuckley

Yeah I think @thewtex has similar success with zstd. He may also have some good thoughts on histology datasets that we could look at.

Aug 15 '19 10:08 jakirkham

@dzenanz surveyed a wide variety of codecs and compression levels over a diverse set of image datasets. In general, he also found that zstd and lz4 with blosc bitshuttle enabled performed best. However, a compression level of 9 did not justify the increased compute time over a compression level of three. @dzenanz could you please share your results?

Aug 26 '19 03:08 thewtex

More histopathology images can be found here:

https://digitalpathologyassociation.org/whole-slide-imaging-repository

Aug 26 '19 03:08 thewtex

In this thread, the first post is my explanation of the benchmark.

Aug 26 '19 14:08 dzenanz

dask-image
dask-image copied to clipboard

Example image data - histology slide

dask-image dask-image copied to clipboard

Example image data - histology slide

dask-image
dask-image copied to clipboard