zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Writing dask arrays to a ZipStore causes a corrupt zip file

Open csubich opened this issue 3 months ago • 7 comments

Zarr version

3.1.3

Numcodecs version

0.16.1

Python Version

3.11.3

Operating System

Linux

Installation

Via binder, with xarray's blank_template.ipynb documentation example

Description

When dask provides the data for a zarr array that is backed by a ZipStore, the ZipStore on disk becomes corrupt and cannot be read back in, throwing a BadZipFile exception. This corruption does not happen with a LocalStore. Nontrivial sharding/chunking is not required.

Steps to reproduce

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

# your reproducer code
import dask.array as da
import numpy as np
import zarr

dask_array = da.zeros((1,),dtype=np.float32)

store = zarr.storage.ZipStore("bug.zarr.zip",mode='w',read_only=False)
# store = zarr.storage.LocalStore("bug.zarr",read_only=False) # Does not error
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create_array(
    name="data",
    shape=dask_array.shape,
    dtype=dask_array.dtype,
    overwrite=True,
)

da.to_zarr(dask_array, zarr_array)
store.close()

store_read = zarr.storage.ZipStore("bug.zarr.zip",mode='r',read_only=True)
# store_read = zarr.storage.LocalStore("bug.zarr",read_only=True) # Does not error
group_read = zarr.open_group(store=store_read, mode="r")
# Error:
## ---------------------------------------------------------------------------
## BadZipFile                                Traceback (most recent call last)
## [...]
## BadZipFile: Bad magic number for file header

array_read = group_read["data"]
array_read[0] # Should be array(0., dtype=float32)

# zarr.print_debug_info()

Additional output

No response

csubich avatar Oct 10 '25 00:10 csubich

I am not an expert in the zip archive format, but my guess is that it is not safe to write concurrently. I think if you want to create zipped zarr data via dask, the best approach is probably to use dask to concurrently write data to local storage, and then zip it in a separate step.

I'm not sure how we can guard against this corruption at runtime without implementing a distributed lock. Open to ideas here.

d-v-b avatar Oct 10 '25 07:10 d-v-b

If it were a concurrency bug, then I'd expect it to also happen with non-dask writes to the ZipStore. In the maximal code where I discovered the bug (by way of xarray), a large-ish but in-memory dataset would write successfully to a ZipStore if the chunking was left at default values (delegating it to zarr), but specifying manual chunks via xarray/dask caused this corruption.

Additionally, this did work correctly in earlier versions of zarr, so I think there's a regression somewhere. In a colab notebook with zarr 2.16.1:

import dask.array as da
import numpy as np
import zarr

print(zarr.__version__)
# 2.16.1

import numcodecs
print(numcodecs.__version__)
# 0.12.1

dask_array = da.zeros((1,),dtype=np.float32)

store = zarr.storage.ZipStore("bug.zarr.zip",mode='w')
group = zarr.open_group(store=store, mode="w")
zarr_array = group.create(
    name="data",
    shape=dask_array.shape,
    dtype=dask_array.dtype,
    overwrite=True,
)

da.to_zarr(dask_array, zarr_array)
store.close()

store_read = zarr.storage.ZipStore("bug.zarr.zip",mode='r')
group_read = zarr.open_group(store=store_read, mode="r")

array_read = group_read["data"]
array_read[0] 
# np.float32(0.0)

csubich avatar Oct 10 '25 13:10 csubich

Additionally, this did work correctly in earlier versions of zarr, so I think there's a regression somewhere. In a colab notebook with zarr 2.16.1:

That's good to know — I hope we can restore this functionality. Now I have no intuition for where the data corruption could be coming from, so hopefully someone who does can figure this out!

d-v-b avatar Oct 10 '25 13:10 d-v-b

I was hoping that the parallelism hint would work out, but the current version's ZipStore.set() takes a lock before calling _set() which performs the write, and 2.16.1's ZipStore.__setitem__ does essentially the same thing:

Unfortunately, I'm getting lost inside dask more deeply than I have time for right now. I can confirm that this still happens with the 'synchronous' scheduler that should eliminate parallelsim, and I can also track it down into how dask.array.blockwise maps load_store_chunk onto the zarr array. However, I cannot get the bug to trigger by calling load_store_chunk myself (even via dask.delayed); easy replication seems to require calling .compute() on a generated (even if trivial) dask graph.

Edit to add: I'm definitely flummoxed. Even if I modify ZipStore to use a monkeypatched ZipFile that logs-to-console every call to _zf.writestr, I see an identical sequence of writes between da.to_zarr (which corrupts) and a direct slice assignment (which succeeds).

csubich avatar Oct 10 '25 15:10 csubich

Under the hood the dask.array.to_zarr() method in the latest stable dask version '2025.11.0' uses the deprecated zarr.create() function.

Yesterday @melonora merged a fix to the dask codebase which implements the new zarr.create_array() function that supports zarr sharding: Support zarr sharding through create_array #12153

My workaround for now is to add dask arrays to a ZipStore by using zarr.create_array() in my code:

import zarr 
import dask.array as da

arr = da.arange(20)
store = zarr.storage.ZipStore('data.zip', mode='w')
array = zarr.create_array(store=store, data=arr)
store.close() # otherwise you will get a corrupted file 

print(array)

# output: 
# <Array zip://data.zip shape=(20,) dtype=int64>

fligt avatar Dec 11 '25 10:12 fligt

FYI, the PR got merged in time for the monthly dask release which would happen either end of this week or start of next week.

melonora avatar Dec 11 '25 15:12 melonora

also FYI https://github.com/zarr-developers/zarr-python/pull/3612

melonora avatar Dec 11 '25 15:12 melonora