Writing dask arrays to a ZipStore causes a corrupt zip file
Zarr version
3.1.3
Numcodecs version
0.16.1
Python Version
3.11.3
Operating System
Linux
Installation
Via binder, with xarray's blank_template.ipynb documentation example
Description
When dask provides the data for a zarr array that is backed by a ZipStore, the ZipStore on disk becomes corrupt and cannot be read back in, throwing a BadZipFile exception. This corruption does not happen with a LocalStore. Nontrivial sharding/chunking is not required.
Steps to reproduce
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
# your reproducer code
import dask.array as da
import numpy as np
import zarr
dask_array = da.zeros((1,),dtype=np.float32)
store = zarr.storage.ZipStore("bug.zarr.zip",mode='w',read_only=False)
# store = zarr.storage.LocalStore("bug.zarr",read_only=False) # Does not error
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create_array(
name="data",
shape=dask_array.shape,
dtype=dask_array.dtype,
overwrite=True,
)
da.to_zarr(dask_array, zarr_array)
store.close()
store_read = zarr.storage.ZipStore("bug.zarr.zip",mode='r',read_only=True)
# store_read = zarr.storage.LocalStore("bug.zarr",read_only=True) # Does not error
group_read = zarr.open_group(store=store_read, mode="r")
# Error:
## ---------------------------------------------------------------------------
## BadZipFile Traceback (most recent call last)
## [...]
## BadZipFile: Bad magic number for file header
array_read = group_read["data"]
array_read[0] # Should be array(0., dtype=float32)
# zarr.print_debug_info()
Additional output
No response
I am not an expert in the zip archive format, but my guess is that it is not safe to write concurrently. I think if you want to create zipped zarr data via dask, the best approach is probably to use dask to concurrently write data to local storage, and then zip it in a separate step.
I'm not sure how we can guard against this corruption at runtime without implementing a distributed lock. Open to ideas here.
If it were a concurrency bug, then I'd expect it to also happen with non-dask writes to the ZipStore. In the maximal code where I discovered the bug (by way of xarray), a large-ish but in-memory dataset would write successfully to a ZipStore if the chunking was left at default values (delegating it to zarr), but specifying manual chunks via xarray/dask caused this corruption.
Additionally, this did work correctly in earlier versions of zarr, so I think there's a regression somewhere. In a colab notebook with zarr 2.16.1:
import dask.array as da
import numpy as np
import zarr
print(zarr.__version__)
# 2.16.1
import numcodecs
print(numcodecs.__version__)
# 0.12.1
dask_array = da.zeros((1,),dtype=np.float32)
store = zarr.storage.ZipStore("bug.zarr.zip",mode='w')
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create(
name="data",
shape=dask_array.shape,
dtype=dask_array.dtype,
overwrite=True,
)
da.to_zarr(dask_array, zarr_array)
store.close()
store_read = zarr.storage.ZipStore("bug.zarr.zip",mode='r')
group_read = zarr.open_group(store=store_read, mode="r")
array_read = group_read["data"]
array_read[0]
# np.float32(0.0)
Additionally, this did work correctly in earlier versions of zarr, so I think there's a regression somewhere. In a colab notebook with zarr 2.16.1:
That's good to know — I hope we can restore this functionality. Now I have no intuition for where the data corruption could be coming from, so hopefully someone who does can figure this out!
I was hoping that the parallelism hint would work out, but the current version's ZipStore.set() takes a lock before calling _set() which performs the write, and 2.16.1's ZipStore.__setitem__ does essentially the same thing:
Unfortunately, I'm getting lost inside dask more deeply than I have time for right now. I can confirm that this still happens with the 'synchronous' scheduler that should eliminate parallelsim, and I can also track it down into how dask.array.blockwise maps load_store_chunk onto the zarr array. However, I cannot get the bug to trigger by calling load_store_chunk myself (even via dask.delayed); easy replication seems to require calling .compute() on a generated (even if trivial) dask graph.
Edit to add: I'm definitely flummoxed. Even if I modify ZipStore to use a monkeypatched ZipFile that logs-to-console every call to _zf.writestr, I see an identical sequence of writes between da.to_zarr (which corrupts) and a direct slice assignment (which succeeds).
Under the hood the dask.array.to_zarr() method in the latest stable dask version '2025.11.0' uses the deprecated zarr.create() function.
Yesterday @melonora merged a fix to the dask codebase which implements the new zarr.create_array() function that supports zarr sharding: Support zarr sharding through create_array #12153
My workaround for now is to add dask arrays to a ZipStore by using zarr.create_array() in my code:
import zarr
import dask.array as da
arr = da.arange(20)
store = zarr.storage.ZipStore('data.zip', mode='w')
array = zarr.create_array(store=store, data=arr)
store.close() # otherwise you will get a corrupted file
print(array)
# output:
# <Array zip://data.zip shape=(20,) dtype=int64>
FYI, the PR got merged in time for the monthly dask release which would happen either end of this week or start of next week.
also FYI https://github.com/zarr-developers/zarr-python/pull/3612