zarr-python
zarr-python copied to clipboard
Creating ZipStore with file-like object
The current ZipStore implementation assumes the input is always a path which becomes a bit limiting when dealing with in-memory file or virtual file object (e.g. in Pyodide in the browser). The virtual file object support is crucial for making zarr library useful in the browser environment, since the emscripten file system itself are rather limited at the moment.
It would be nice if we can allow passing a file-like object. I did a test with a small modification in the init function and it seems to be working nicely in Pyodide/JupyterLite:
from threading import Lock, RLock
import zipfile
class InMemoryZipStore(zarr.ZipStore):
def __init__(self, path, compression=zipfile.ZIP_STORED, allowZip64=True, mode='a',
dimension_separator=None):
# store properties
self.path = None # TODO: This need to be handled properly for os.PathLike or file-like object
self.compression = compression
self.allowZip64 = allowZip64
self.mode = mode
self._dimension_separator = dimension_separator
# Current understanding is that zipfile module in stdlib is not thread-safe,
# and so locking is required for both read and write. However, this has not
# been investigated in detail, perhaps no lock is needed if mode='r'.
self.mutex = RLock()
# open zip file
self.zf = zipfile.ZipFile(path, mode=mode, compression=compression,
allowZip64=allowZip64)
I will maintain this piece of code somewhere for now, but it would be great if we can support this from upstream and eventually have it in pyodide.
Hey @oeway. Thanks for the suggestion! In your mind is what we need the addition of an optional argument for path? Seems like the only possible downside of that would be if someone was expecting the exception from os.path.abspath(path) to detect None.
Hey @oeway. Thanks for the suggestion! In your mind is what we need the addition of an optional argument for
path? Seems like the only possible downside of that would be if someone was expecting the exception fromos.path.abspath(path)to detectNone.
How about we rename the first argument path to file_or_path? Then we try to infer the path based on the types:
if isinstance(file_or_path, (str, os.PathLike)):
self.path = file_or_path
else: # file-like object
if hasattr(file_or_path, 'name'):
# Normal file object has the name property which contains the path
self.path = file_or_path.name
else:
self.path = "" # the default is empty? or None?
self.zf = zipfile.ZipFile(file_or_path, mode=mode, compression=compression,
allowZip64=allowZip64)
I wouldn't suggest renaming since existing code could be using ZipStore(path=). If permitting a FileLike for something named "path" is too confusing, then perhaps we add a static initializer ZipStore.from_file(file, *args, **kwargs) (alternative names: wrap(), ...)
I wouldn't suggest renaming since existing code could be using
ZipStore(path=). If permitting a FileLike for something named "path" is too confusing, then perhaps we add a static initializerZipStore.from_file(file, *args, **kwargs)(alternative names:wrap(), ...)
Good point! ZipStore.from_file sounds good to me!
Think it would be ok to just accept ZipFile instances in path. We can just clarify the docs on this point.
We have similar flexibility in Dask.
As a side note we might consider using position only and keyword only arguments in Zarr 3 to avoid these issues around renaming arguments.
Issue created, @jakirkham.
Think it would be ok to just accept
ZipFileinstances inpath.
So the logic would be:
- is_zip_file()
- is_file_like()
- is_path_or_string_like()
Or would you just pass a ZipFile, @oeway ?
Hi, for my case it's a fake file object making http requests to a file with range header, so not a zipfile.
Thanks Josh! 😄
Intriguing would be curious to look at the object if you don't mind sharing 🙂
Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it isinstance of one of those and use it
Intriguing would be curious to look at the object if you don't mind sharing 🙂
Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it
isinstanceof one of those and use it
Sure, here it is: https://github.com/imjoy-team/imjoy-rpc/blob/af739ec829d984da35bc5b87b93aa1a553944fe3/python/imjoy_rpc/utils.py#L672-L842
It's a class inherit from io.IOBase, I guess fsspec also has something similar.
cc @martindurant (in case this is of interest especially given our ZIP discussion recently)
I would point out that fsspec supports a zarr-compatible key-value store over any fs it can instantiate, including ZIP and in-memory. You can pass these directly to zarr.
# this bit just for example
data = fsspec.filesystem("http").cat("https://github.com/fsspec/filesystem_spec/archive/refs/heads/master.zip")
fsspec.filesystem("memory").pipe("mem.zip", data)
# make storage mapper like this
store = fsspec.get_mapper("zip::memory://mem.zip")
Obviously the thing above won't work in pyidide, since the HTTP part uses aiohttp - I should have said. But you can replace this part or wait until I write HTTP-for-pyodide (maybe next week?!).
The separate issue I was talking with @jakirkham about, is that accessing this way uses a file - so access to zarr chunks would be always serial. With kerchunk, we can index the ZIP (i.e., translate the existing index embedded in the file) and attain concurrent access to zarr chunks. This is pretty simple, but not yet done, but should work in pyodide/pyscript too.
I was just looking at the same problem. Creating from another ZipFile, or file (e.g. io.BytesIO) would be super useful.
I was actually looking at potentially contributing this but there are a number of places in ZipFile where the .zf is recreated from self.path again. That may be an issue if en external file (instead of a path) is provided, no?
FSStore already allows for this. In fsspec, you can pass a URL like "zip::s3://bucket/file.zip", or you can pass explicit arguments to the ZIP backend if you like. You don't need to rewrite ZipStore.
Hi @martindurant, thanks so much for your response!
I have been looking at the documentation for fsspec and the "zip" filesystem is documented as being read-only: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.zip.ZipFileSystem. I was able to successfully read from an fsspec-based zip file store, but according to my reading of the docs, I would not be able to create new in-memory zip file with the combination of fsspec and zarr.
But perhaps I'm misunderstanding your something, in which case your feedback is much appreciated!
Thank you.
Good point. There's no principled reason that ZipFileSystem should be read-only, except that writing would be a terrible idea on a key-value storage (every update would need to rewrite the file). However, it'd work fine for local, in-memory or cached-to-remote files.
You can try!
--- a/fsspec/implementations/zip.py
+++ b/fsspec/implementations/zip.py
@@ -44,10 +44,8 @@ class ZipFileSystem(AbstractArchiveFileSystem):
a string.
"""
super().__init__(self, **kwargs)
- if mode != "r":
- raise ValueError("Only read from zip files accepted")
if isinstance(fo, str):
- files = open_files(fo, protocol=target_protocol, **(target_options or {}))
+ files = open_files(fo, mode=mode+"b", protocol=target_protocol, **(target_options or {}))
if len(files) != 1:
raise ValueError(
'Path "{}" did not resolve to exactly'
@@ -55,7 +53,7 @@ class ZipFileSystem(AbstractArchiveFileSystem):
)
fo = files[0]
self.fo = fo.__enter__() # the whole instance is a context
- self.zip = zipfile.ZipFile(self.fo)
+ self.zip = zipfile.ZipFile(self.fo, mode=mode)
self.block_size = block_size
self.dir_cache = None```
Should fail on any file object that doesn't allow seek while writing.
https://github.com/fsspec/filesystem_spec/pull/1017
Intriguing would be curious to look at the object if you don't mind sharing 🙂 Does it support one of the file ABCs? Or could it? If so, then it would be pretty easy to check if it
isinstanceof one of those and use itSure, here it is: https://github.com/imjoy-team/imjoy-rpc/blob/af739ec829d984da35bc5b87b93aa1a553944fe3/python/imjoy_rpc/utils.py#L672-L842
It's a class inherit from
io.IOBase, I guess fsspec also has something similar.
Just a note, @oeway, think we could just test for io.IOBase then. Am guessing this is a 4-5 line change here. Namely check type, use it, protect against closing for user-provided files (as presumably users would want to handle closing).
Cross posting a related SO post and my answer there. https://stackoverflow.com/questions/74127357/how-to-create-and-return-a-zarr-file-from-xarray-dataset/74148410?noredirect=1#comment132344162_74148410
I wish I had seen this issue first 🤷. But in the end, I came up with a similar solution to @oeway.
Hey @jhamman thanks for pointing me to this issue, here from SO 👋
In the time between the update, I tried the following and it seems to do the trick:
import zarr
import tempfile
import os
mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')
with tempfile.SpooledTemporaryFile(max_size= int((mem_bytes+1)), dir= TMP_PLACEHOLDER, suffix= 'zip') as f:
with tempfile.NamedTemporaryFile() as f2:
# where `data` is a xarray dataset
data.to_zarr(zarr.ZipStore(f2.name), mode='w')
return f2.read()
Although, I'm not 100% certain that this does not write to disk somewhere (I've had instances in the past where tempfile behaves in a weird manner and writes to disk in a place I did not expect), from my observations, it does not write to disk.
Note that max_size is set to be greater than all of the memory on machine. So it should never write to disk. The data sets I'm working with currently aren't large enough to test the bounds but I'd assume a try/except block would cover that scenario.