ome-zarr-py
ome-zarr-py copied to clipboard
Error when writing to disk a dask-backed xarray
The following code example produces an error on the last line.
The code does the following:
- creates a 3D array (cyx axes) and writes it to zarr
- loads it from zarr
- writes to another file what has been loaded
##
import os.path
import numpy as np
import zarr
import shutil
from ome_zarr.writer import write_image
from ome_zarr.io import parse_url
from ome_zarr.io import ZarrLocation
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.format import CurrentFormat
import dask.array.core
im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()
## write
def write_to_zarr(im, f):
if os.path.isdir(f):
shutil.rmtree(f)
store = parse_url(f, mode="w").store
group = zarr.group(store=store)
write_image(im, group, axes=["c", "x", "y"], fmt=fmt)
write_to_zarr(im, "debug0.zarr")
## read
loc = ZarrLocation("debug0.zarr")
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)
## write again (error)
write_to_zarr(im_read, "debug1.zarr")
##
import xarray as xr
xr.DataArray(im).chunks
The error is the following:
Traceback (most recent call last):
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3335, in run_ast_nodes
code = compiler(mod, cell_name, mode)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/codeop.py", line 143, in __call__
codeob = compile(source, filename, symbol, self.flags, True)
File "<ipython-input-3-61b5ce34eaa1>", line 1
SyntaxError: 'return' outside function
Traceback (most recent call last):
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/hierarchy.py", line 800, in _write_op
return f(*args, **kwargs)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/hierarchy.py", line 965, in _create_dataset_nosync
a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
**File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/creation.py", line 380, in array
z[...] = data**
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1353, in __setitem__
self.set_basic_selection(pure_selection, value, fields=fields)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1448, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/core.py", line 1746, in _set_basic_selection_nd
indexer = BasicIndexer(selection, self)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 342, in __init__
dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 176, in __init__
self.nchunks = ceildiv(self.dim_len, self.dim_chunk_len)
File "/Users/macbook/miniconda3/envs/ome/lib/python3.9/site-packages/zarr/indexing.py", line 160, in ceildiv
return math.ceil(a / b)
TypeError: unsupported operand type(s) for /: 'int' and 'list'
python-BaseException
Investigating the cause of the problem I found what follows. In the first write_image()
call, the im
array, which is a np.ndarray
, is converted to a zarr.core.Array
in creation.py
(see the bold line in the stack trace). More precisely, first an empty object is initialized and then it is filled with data.
# creation.py
# instantiate array
z = create(**kwargs)
# fill with data
z[...] = data
After the instantiation line, the object z
has an attribute chunks
which is (3, 100, 100)
. So far so good.
When instead write_image()
is called for the second time, the im
array is a dask.array.core.Array
, and after the intantiation line above the attribute z.chunks
is ([3], [100], [100])
. This triggers the error that I am reporting at the line that follows (filling z
with data).
From a performance point of view, what my code is asking is to take data that is on disk and write it somewhere else on disk, without the need to load the content in memory. So a workaround would be that I modify the write_to_zarr()
function so that if the im
data that I want to copy is a dask.array.core.Array
, I instead manually do a copy at the file level, without using the ome-zarr-py
APIs.
Any advice in how to proceed?
Hi, apologies for the delay...
I wonder if you could try this installing ome-zarr-py
from the branch at https://github.com/ome/ome-zarr-py/pull/192
since I'm hoping that will do what you want.
I have tried that branch and the code now works, thank you!
Sorry to reopen already, I actually noticed a problem. There is no error in writing the files and the output looks the same, but what is written is slightly different in the .zarray
.
Here is the first chunk of the image, first written from an in-memory array and then from the delayed representation. As you can see the "compressor"
key is different.
macbook@MBP-di-macbook 0 % cat tmp.zarr/image_0/0/.zarray
{
"chunks": [
3,
100,
100
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dimension_separator": "/",
"dtype": "<f8",
"fill_value": 0.0,
"filters": null,
"order": "C",
"shape": [
3,
100,
100
],
"zarr_format": 2
}%
macbook@MBP-di-macbook 0 % cat tmp2.zarr/image_0/0/.zarray
{
"chunks": [
3,
100,
100
],
"compressor": null,
"dimension_separator": "/",
"dtype": "<f8",
"fill_value": 0.0,
"filters": null,
"order": "C",
"shape": [
3,
100,
100
],
"zarr_format": 2
}%
I found a more serious bug. I disabled the compression passing storage_options={"compressor": None}
to write_image
, and now I noticed that the coordinate transformations stored in .zattrs
don't match. See the output below:
Thanks for the feedback @LucaMarconato, I wonder whether you are encountering the same issue that was attempted to be fixed in https://github.com/ome/ome-zarr-py/pull/197. As this branch is currently conflicting with the inclusion of the dask writing code, could you try updating the corresponding location in the writer code to use storage_options.copy()
?
I changed the write_image()
line to
write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None}.copy())
but the coordinate transformations remain the same as in the screenshot above.
@LucaMarconato could you try https://github.com/ome/ome-zarr-py/pull/220 which should fix the coordinateTransforms
issue above?
I don't know the reason for the compressor being different. I can only assume different default behaviour between group.create_dataset()
for numpy data and dask.to_zarr()
.
I tried the fix in https://github.com/ome/ome-zarr-py/pull/197/ but it doesn't change the default behaviour in the example code above (when no compressor is specified)
I tried #220 and it fixes the coordinateTransformations
entries, which now coincide. But some of the binary data is not identical!
(ome) macbook@MBP-di-macbook temp % diff debug0.zarr/0/0/0/0 debug1.zarr/0/0/0/0
(ome) macbook@MBP-di-macbook temp % diff debug0.zarr/1/0/0/0 debug1.zarr/1/0/0/0
Binary files debug0.zarr/1/0/0/0 and debug1.zarr/1/0/0/0 differ
I have tried both passing {'compressor': compressor}
with compressor=None
and compressor=Blosc(cname="zstd", clevel=5, shuffle=Blosc.NOSHUFFLE)
but the bug remains. So I think this is not due to the compressor.
Since the full resolution image (0
) is identical maybe this is due to small numerical unavoidable errors when downscaling into the smaller levels of the image pyramid. I will now check if this is the case.
No, that's not the case. Those numbers are very different, but they have mostly the same sign, so I suspect it is something related to computing the downscaled versions with different chunks but that are mostly overlapping, maybe a few pixels off.
I think this is now failing due to the bug fixed in #197.
The storage_options
are not passed down to subsequent levels after the first level is written.
With that fix, and updating your script to use:
write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={'compressor': None})
I now get no difference between downscaled chunks. Hope you see the same?
Instead of testing with PR #197 (which has merge conflicts), can you try with https://github.com/ome/ome-zarr-py/pull/221 where I've fixed the merge conflict on top of master branch.
Thanks, I have tried with #221 but I still get different chunks. Here is the code that I use, updated from above. I also check for numerical approximation errors (it's not the cause).
##
import os.path
import shutil
import dask.array.core
import numpy as np
import zarr
from ome_zarr.format import CurrentFormat
from ome_zarr.io import ZarrLocation, parse_url
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.writer import write_image
import filecmp
im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()
# write
def write_to_zarr(im, f):
if os.path.isdir(f):
shutil.rmtree(f)
store = parse_url(f, mode="w").store
group = zarr.group(store=store)
if isinstance(im, np.ndarray) or isinstance(im, dask.array.core.Array):
write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None})
else:
raise ValueError("the array to write must be a numpy array or a dask array")
write_to_zarr(im, "debug0.zarr")
# read
loc = ZarrLocation("debug0.zarr")
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)
# write again (error)
write_to_zarr(im_read, "debug1.zarr")
# from stackoverflow
class dircmp(filecmp.dircmp): # type: ignore[type-arg]
"""
Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
subclass compares the content of files with the same path.
"""
def phase3(self) -> None:
"""
Find out differences between common files.
Ensure we are using content comparison with shallow=False.
"""
fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files, shallow=False)
self.same_files, self.diff_files, self.funny_files = fcomp
def are_directories_identical(dir1, dir2):
compared = dircmp(dir1, dir2)
if compared.left_only or compared.right_only or compared.diff_files or compared.funny_files:
return False
for subdir in compared.common_dirs:
if not are_directories_identical(
os.path.join(dir1, subdir),
os.path.join(dir2, subdir),
):
return False
return True
# inspect the content of the files to manually check for numerical approximation errors
# if the output matrices are not close up to the machine precision, then the difference is not due to numerical errors
loc = ZarrLocation('debug0.zarr')
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
loc = ZarrLocation('debug1.zarr')
reader = Reader(loc)()
nodes = list(reader)
assert len(nodes) == 1
node = nodes[0]
im_read2 = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
print(im_read[:5, :5, 0].compute())
print(im_read2[:5, :5, 0].compute())
##
assert are_directories_identical("debug0.zarr", "debug1.zarr")
Thanks for that...
Sorry, I realised that in asking you to test with #221 above, this omitted #220, which fixes another resize issue (so the .zattrs
is different).
With both of those branches merged, the assertion fails because the resized chunks are different. e.g.
debug0.zarr/1/0/0 debug1.zarr/1/0/0
It's hard to "see" what's going on with the random pixel data, so I tried using a sample generated with...
ome_zarr create debug0.zarr --method=astronaut
Then I tweaked your script to be able to use this (not overwrite if debug0.zarr
exists. Also added some printing to show which files fail the assert are_directories_identical()
.
And I commented out assert len(nodes) == 1
where needed, since the sample image has 'labels' nodes etc.
full script
import os.path
import shutil
import dask.array.core
import numpy as np
import zarr
from ome_zarr.format import CurrentFormat
from ome_zarr.io import ZarrLocation, parse_url
from ome_zarr.reader import Multiscales, Reader
from ome_zarr.writer import write_image
import filecmp
im = np.random.normal(size=(3, 100, 100))
fmt = CurrentFormat()
# write
input_img = "debug0.zarr"
output_img = "debug1.zarr"
def write_to_zarr(im, f):
if os.path.isdir(f):
shutil.rmtree(f)
store = parse_url(f, mode="w").store
group = zarr.group(store=store)
if isinstance(im, np.ndarray) or isinstance(im, dask.array.core.Array):
write_image(im, group, axes=["c", "x", "y"], fmt=fmt, storage_options={"compressor": None})
else:
raise ValueError("the array to write must be a numpy array or a dask array")
# allow use of existing data if it exists...
if not os.path.isdir(input_img):
write_to_zarr(im, input_img)
# read
loc = ZarrLocation(input_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="0", version=fmt.version)
# write again (error)
write_to_zarr(im_read, output_img)
# from stackoverflow
class dircmp(filecmp.dircmp): # type: ignore[type-arg]
"""
Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
subclass compares the content of files with the same path.
"""
def phase3(self) -> None:
"""
Find out differences between common files.
Ensure we are using content comparison with shallow=False.
"""
fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files, shallow=False)
self.same_files, self.diff_files, self.funny_files = fcomp
def are_directories_identical(dir1, dir2):
compared = dircmp(dir1, dir2)
if compared.left_only or compared.right_only or compared.diff_files or compared.funny_files:
print('compared.left_only', compared.left_only)
print('compared.right_only', compared.right_only)
print('compared.diff_files', compared.diff_files)
print('compared.funny_files', compared.funny_files)
return False
for subdir in compared.common_dirs:
if not are_directories_identical(
os.path.join(dir1, subdir),
os.path.join(dir2, subdir),
):
print("DIFF", os.path.join(dir1, subdir), os.path.join(dir2, subdir))
return False
return True
# inspect the content of the files to manually check for numerical approximation errors
# if the output matrices are not close up to the machine precision, then the difference is not due to numerical errors
loc = ZarrLocation(input_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
loc = ZarrLocation(output_img)
reader = Reader(loc)()
nodes = list(reader)
# assert len(nodes) == 1
node = nodes[0]
im_read2 = node.load(Multiscales).array(resolution="1", version=CurrentFormat().version)
print(im_read[:5, :5, 0].compute())
print(im_read2[:5, :5, 0].compute())
##
assert are_directories_identical(input_img, output_img)
Running the script.... the assert are_directories_identical()
fails due to the presence of labels
, and the .zattrs
is different because the original contains extra omero
channel information. But this is easier to compare the mis-match pixel values...
$ python write_dask_test.py
[[144 214 235 235 228]
[141 206 228 226 222]
[143 206 224 226 217]]
[[146 204 232 234 228]
[140 197 225 227 222]
[147 196 222 224 217]]
compared.left_only ['labels']
compared.right_only []
compared.diff_files ['.zattrs']
compared.funny_files []
Traceback (most recent call last):
File "/Users/wmoore/Desktop/python-scripts/zarr_scripts/write_dask_test2.py", line 99, in <module>
assert are_directories_identical(input_img, output_img)
AssertionError
To visually compare them I opened the lower resolution array in napari...
Choosing NOT to use ome-zarr-py
to read. Either npe2
or builtins
works fine.
$ napari debug0.zarr/1
Then I dragged the dir debug1.zarr/1
onto the napri window to open that in the same way.
By colouring one of these layers red
and one green
and using additive
blending of the top layer, you get a yellow layer where the pixel values are equal.
But if you zoom in you can see where there is a difference, where some pixels are more red or green than yellow.
I'm not entirely sure why this is happening.
As far as I can see, both the dask and non-dask resizing are using skimage.transform.resize
to do the underlying resizing.
The non-dask implementation is doing this with a 2D plane at a time, whereas the dask resize is using da.map_blocks
to iterate through the chunks, applying the resize.
In order to dig deeper, we'd need a test that directly compares the resize methods themselves, comparing resize of different shaped arrays etc. It might be possible to get the same pixel values in resized data via the 2 different methods, but it might not. I'm not familiar enough with zarr/dask to know without more work.
Hopefully that's possible, but if they don't get any closer in their output, how much of a problem is that? I'm assuming that downsampling is mostly (exclusively?) used for visualisation rather than analysis, so the slight differences we're seeing here might not be a deal-breaker?
Anyway, thanks for highlighting these differences. It's good to know, and I'll see if I can understand what's going on a bit better...
Perfect thanks a lot for the detailed explanation! I have modified our tests so that we compare only the full resolution image, since those small differences should not be a problem for visualization of downstream tasks based on the lower resolution images.
Dask writing support now released in ome-zarr-py 0.6.0
.
If there are any new issues with that, please open them in another issue. Closing this one for now...