Resizing an array in a ZipFileStore will lead to many duplicate entries for zarr.json in the ZipFile.
Zarr version
3.1.3
Numcodecs version
0.16.3
Python Version
3.14.0
Operating System
Linux
Installation
Pip inside a conda environment
Description
The behavior I observe is: When I append items to an array that is stored in a ZipStore, I get UserWarning: Duplicate name: 'zarr.json' and the *.zip file on disk will contain duplicate entries for zarr.json.
This behavior is problematic for the following reasons:
- The *.zip viewer of my desktop environment (KDE's Ark), and possibly others, do not show me the duplicate entries for
zarr.json, but only the earliest version. This has made debugging the issue quite a mystery adventure, because I did not even know that this was possible before. - Only when I copied the *.zip file to a Windows machine did I even see that there are multiple entries.
- Storing all those duplicates in the *.zip file is not only epically confusing, but also a (small) waste of space.
The expected behavior would be that if I do what I do, I don't get this warning, and the old version of zarr.json is properly deleted/overwritten, such that there really is only one single entry for that file name.
This expected behavior is better, because it makes aforementioned confusion impossible, saves space and properly resolves the justified warning.
This issue seems to be closely related, but not equivalent to https://github.com/zarr-developers/zarr-python/issues/129 , https://github.com/Deltares/imod-python/pull/1706 and https://github.com/Deltares/imod-python/issues/1707 . I apologize if the problem I describe here is just a subset of those, but from scrolling over those issues I could not reliably tell if the people there have realized that specifically zarr.json is affected.
Steps to reproduce
# /// script
# requires-python = ">=3.14"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# "numpy"
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zipfile
import numpy
import zarr.storage
from zarr.storage import ZipStore
out_path = "test.zip"
n = 13
with ZipStore(out_path, mode='w', read_only=False, compression=zipfile.ZIP_STORED, allowZip64=False) as store:
shape = (42, 42)
array = zarr.create_array(store=store, shape=(7, *shape),dtype='uint8')
# Now we want to append elements to the array one by one, which is a very very common usage pattern
# if you simply cannot know the length of the input beforehand.
for idx in range(n):
# This is just some dummy data. In actual use cases, this might be image data coming from a webcam.
data = numpy.zeros(shape, dtype=numpy.uint8)
# Maybe we were able to pre-estimate the length of the input.
# But estimates cannot be guaranteed to be correct.
if idx < array.shape[0]:
array[idx] = data
else:
# I think the following line is the one that raises warnings.
array.append(data.reshape(1, *shape), axis=0)
# This assertion always passes.
assert array.shape == (n, *shape)
with zipfile.ZipFile(out_path, 'r') as zf:
with zf.open('zarr.json') as metadata_file:
# The following assertion checks if the excepted array length is found in the metadata file.
# The assertion may pass, if nondeterministically the latest of the many duplicate versions of
# zarr.json ends up being read:
assert f"{n}," in metadata_file.read().decode("utf-8")
zarr_info_versions = [info for info in zf.infolist() if info.filename == 'zarr.json']
# The following assertion is going to fail:
assert len(zarr_info_versions) == 1
# If you manually inspect the Zip file, though, you will either see only one single entry for `zarr.json`,
# which is your archive viewer lying to you, or you will see multiple different versions of `zarr.json`.
# zarr.print_debug_info()
Additional output
No response
thanks for this report! @mkitti do you have any ideas for what zarr python is doing wrong here?
BTW, a workaround that I am likely going to use is: The ZipInfo objects provided by Python's zipfile package allow me to retroactively eliminate the duplicates of zarr.json, except for the latest one. This will neutralize the bug in my use case until it is properly resolved.
zarr-python uses the Python standard library zipfile which does not remove files from the central directory apparently.
https://github.com/python/cpython/issues/47073
If you call infolist you get a direct reference to zipfile's filelist attribute. You could filter that list in-place by removing the previous ZipInfo entry for the file you are replacing before closing the file.
The old zarr.json and chunks will still take up space within the file when they are overwritten, but at least the canonical versions will be listed in the central directory.
That said, I do not recommend writing to Zarr within a zipfile.
Thanks for looking into this, @mkitti :-)
I do not quite understand that last sentence: "Writing to Zarr within a zipfile". Is this a typo?
In the meantime, I found out that Python's zipfile might actually soon support removing entries from zip files: http://github.com/python/cpython/pull/134627 . Until that is the case, though, the workaround I suggested above will not (easily) work.
I have thus come up with a different workaround: I can subclass zarr.storage.ZipStore and intercept any updates to the zarr.json key, such that only when the store is closed it is actually ever written.
In the following I will show the code I use for that. Take it with a grain of salt, though: While it does demonstrate how the problem can be worked around, and even hints at how it might be fixed, I am fairly new to zarr and intend to make only my narrow use cases work with it. I do not know if simply checking for zarr.json is enough, or if storing multiple arrays in one zip store might require the code to become a little more sophisticated. Also, the code I am showing here might easily stop working if the implementation of zarr.storage.ZipStore changes in the future. So this is not a very robust solution and at the very least should be protected by some test cases. Another problem is that ZipStore.__getstate__ and ZipStore.__setstate__ will automatically pick up the new _deferred field that I am introducing here. Maybe that will work just fine, but I have confidently omitted testing this:
from zarr.storage import ZipStore
class AppendableZipStore(ZipStore):
"""
This class is supposed to be a drop-in replacement for zarr.storage.ZipStore.
It works around the problem described in https://github.com/zarr-developers/zarr-python/issues/3580 .
"""
# If one calls Array.append on arrays stored in a ZipStore, the entry "zarr.json" needs to be overwritten,
# to update the information on the array shape.
# However, the ZipStore does actually overwrite entries in the zip file
# (likely because Python's zipfile does not support this yet, see http://github.com/python/cpython/pull/134627)
# and will create duplicates instead.
# This is very problematic, because it raises warnings from Python's zipfile module and
# because the duplicate entries make the zip file confusing to interpret, especially if one has no knowledge of Zarr.
# In addition, some archive viewers, like KDE's Ark do not actually resolve the duplicates, and simply pretend
# that only one of them exists, which can make it impossible for the user to even notice that they are looking
# at just one of many duplicates.
# This is why zarr.json is on the list of entries that will be deferred until closure of the store.
__to_defer = ["zarr.json"]
def __init__(self, *largs, **kwargs):
super().__init__(*largs, **kwargs)
# Some keys will only be written out to disk when the store is finally closed.
self._deferred = {}
def _get(self, key, prototype, byte_range):
# ASSUMPTION: This is called under self._lock.
try:
return self._deferred[key]
except KeyError:
return super()._get(key, prototype, byte_range)
def _set(self, key, value):
# ASSUMPTION: This is called under self._lock.
if key in ApfioZipStore.__to_defer:
self._deferred[key] = value
else:
super()._set(key, value)
async def clear(self):
# ASSUMPTION: self._lock is reentrant.
with self._lock:
self._deferred.clear()
return await super().clear()
def close(self):
# ASSUMPTION: self._lock is reentrant.
with self._lock:
for key, buffer in self._deferred.items():
super()._set(key, buffer)
self._deferred.clear()
super().close()
That said, I do not recommend writing to Zarr within a zipfile.
Trying to update a zarr hierarchy or data in a zarr array while it is in a zip file is a bad idea. Zip files are not really designed for this.
As you can see, there is a duplication issue. Looking for a prior zarr.json in the same place depends on the number of files in the zip archive.
The pull request you cited has a remove method which only removes the entry from the central directory but not from the file itself. Repack is required to do so.
In everything but the most trivial cases, you would probably be better off extracting the contents, manipulating the zarr hiearchy on the file system, and the rezipping it up.
I see, makes sense. What about my second workaround: Is deferring the write to zarr.json until store closure a viable option? It does function as a workaround in my use case, but I'd imagine that it could also solve the problem for good if it were integrated into zarr.storage.ZipStore itself.
Is deferring the write to
zarr.jsonuntil store closure a viable option?
I think the problem you are having with zarr.json also holds for chunk files. We could cache everything until store closure, but this introduces a fair number of complications.
Bigger picture, why do you need mutate the array metadata document?
My recommendation here would be to load Zarr array into memory or unzip the Zarr depending on the size, manipulate it in that form (in memory or unzipped), then resave as a zip file if desired.
Even if we improve the implementation by correctly manipulating the central directory, you will need to "repack" the zip file anyways to reduce storage space.
Deleting, replacing, or updating a file in a zip file does not free the space that the old file took up. It just appends the new file to the archive.
Regarding the bigger picture: I work in a visual computing lab. A single multi-view recording with one of our various lab setups easily produces terabtyes of data and hundreds of thousands of frames. It has been decided that for some of our use cases we are going to store image files in zip files. (Please take for granted that we know video compression inside-out and that large-scale research projects can produce legit use cases for zipping image files nevertheless).
Being able to stream frame sequences (without knowing their lengths beforehand) into a file, without having to rewrite that file a second time is highly desirable for may kinds of processing that we do. We are not interested in modifying a frame sequence after it has been written. But we want to write it once in-order and be done. So we do not even intend to mutate the array metadata document. But since Zarr writes the zarr.json out to the Zip file immediately upon array creation, it must update it every time I extend the array by another frame. Just delaying the write of the .json file to the very end of the storage process (after the last frame has been written) solves this. The chunks on the other hand will be written exactly one time and will never be changed again. So those cannot cause any duplicate Zip file entries in this case.
I completely agree with @mkitti: It is wrong to delete entries from zip files, and/or repack them. This ceased to be my intention already 3 days ago (see https://github.com/zarr-developers/zarr-python/issues/3580#issuecomment-3517470864 ), precisely because I do understand how Zip files work.
What I am suggesting, though, is to make the Zarr store delay writing the zarr.json file, as implemented in https://github.com/zarr-developers/zarr-python/issues/3580#issuecomment-3517470864 . I have already verified that this is a good solution for my use case and I am asking if it might make sense to implement this behavior in Zarr properly (maybe with some optional parameter to the ZipStore constructor?), to open it up to more use cases where one wants to write data in a streaming fashion.
thanks for that context, I can see how streaming data introduces the requirement to write zarr.json at the end.
I wonder if the current design of the Array class is actually the root of the problem here. When you create an array, the first thing we do is write an array metadata document to storage. This means you write the metadata document before writing any chunks, but the upside is that the stored representation is always a valid zarr array. But there are situations like streaming data where the order of the IO matters, and it would be better to write the chunks before writing metadata.
I would rather not introduce new changes to the IO patterns of the Array class, but we could introduce new lower-level APIs. What would work for the use case here would be two functions like this:
write_array(array_value, region, metadata, store) # stores an in-memory array to a particular region of a zarr array, given zarr array metadata
write_metadata(metadata, store) # writes the zarr array metadata to storage
Would an API like this be helpful @gfhcs?
As for changing the behavior of the built-in zipstore, I'm not sure special-casing zarr.json and only writing it at the end is going to work for everyone, because there might be people who rely on the store being readable while chunks are being written (an effect of the current "metadata, then chunks" behavior). Not to mention the complication of baking zarr v3 logic into the store API, which so far has been zarr-format-agnostic.
Regarding the bigger picture: I work in a visual computing lab. A single multi-view recording with one of our various lab setups easily produces terabtyes of data and hundreds of thousands of frames.
I also work in such an environment. Shards are a better answer to this than zip files in my opinion and much easier to create at acquisition.
I have thus come up with a different workaround: I can subclass zarr.storage.ZipStore and intercept any updates to the zarr.json key, such that only when the store is closed it is actually ever written
In general, I do not think this is a good idea. I think it is fine having multiple zarr.json files in in the zip archive. We want to write this eagerly. A software failure where the zarr.json never gets written seems like a more problematic scenario to me.
The core bug here is that there are multiple entries in the central directory. I would focus on editing the central directory on close. As long as the central directory does not have duplicate zarr.json entries, then everything else should be fine. Zip readers should only be reading the central directory as the canonical source of the contents of the zip file.
Rather than caching the zarr.json until close, I would just edit the Python's zipfile.filelist in place by obtaining it from zipfile.infolist() just before closing. Specifically, I would find the last entry of every file in the list and just retain those. Also, while editing the central directory I would probably take the opportunity to put the zarr.json metadata files first followed by a sorted list of the rest of the files.
See also https://ngff.openmicroscopy.org/rfc/9/index.html#ome-zarr-zip-files