dask-image icon indicating copy to clipboard operation
dask-image copied to clipboard

Example image data - histology slide

Open GenevieveBuckley opened this issue 6 years ago • 8 comments

We want to use a histology slide image from the 2016 Camelyon dataset (CC0): https://camelyon17.grand-challenge.org/Data/

This thread contains details specific to this dataset.

Related to the larger discussion here: https://github.com/dask/dask-image/issues/107

GenevieveBuckley avatar Aug 05 '19 22:08 GenevieveBuckley

Comparing the different compression algorithms and compression levels, it was found that zstd with a compression level = 9 (max) reduced a 381.5MB file to 507KB was the most effective.

import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='zlib', clevel=9, shuffle=Blosc.BITSHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
print(z.compressor)
print(z.info)

Code used from https://zarr.readthedocs.io/en/stable/tutorial.html

Summary of Results: detailed compression results.pdf

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 3379344 (3.2M)
Storage ratio      : 118.4
Chunks initialized : 100/100
Note: 2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 519332 (507.2K)
Storage ratio      : 770.2
Chunks initialized : 100/100
Note: 45 seconds


(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13543704 (12.9M)
Storage ratio      : 29.5
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13788015 (13.1M)
Storage ratio      : 29.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5137515 (4.9M)
Storage ratio      : 77.9
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5129740 (4.9M)
Storage ratio      : 78.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

timbo8 avatar Aug 06 '19 00:08 timbo8

Thank you @timbo8! This is very helpful for us :)

GenevieveBuckley avatar Aug 06 '19 00:08 GenevieveBuckley

@timothywallaby has done a bunch of work at the pyconau sprints working out how to save openslide images as compressed zarr files. This gist shows how to do that if you can fit all the array into memory. Thanks @timothywallaby!

We're working on how to append to zarr arrays now, for cases where you cannot fit the entire image into memory.

GenevieveBuckley avatar Aug 06 '19 04:08 GenevieveBuckley

The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py

The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.

GenevieveBuckley avatar Aug 07 '19 08:08 GenevieveBuckley

Yeah I think @thewtex has similar success with zstd. He may also have some good thoughts on histology datasets that we could look at.

jakirkham avatar Aug 15 '19 10:08 jakirkham

@dzenanz surveyed a wide variety of codecs and compression levels over a diverse set of image datasets. In general, he also found that zstd and lz4 with blosc bitshuttle enabled performed best. However, a compression level of 9 did not justify the increased compute time over a compression level of three. @dzenanz could you please share your results?

thewtex avatar Aug 26 '19 03:08 thewtex

More histopathology images can be found here:

https://digitalpathologyassociation.org/whole-slide-imaging-repository

thewtex avatar Aug 26 '19 03:08 thewtex

In this thread, the first post is my explanation of the benchmark.

dzenanz avatar Aug 26 '19 14:08 dzenanz