zarr-python
zarr-python copied to clipboard
ZipStore fails to handle scalar string arrays
Minimal, reproducible code sample, a copy-pastable example if possible
import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, data=data, shape=data.shape, dtype=data.dtype)
zarr_array[...]
# zarr_array = root.create_dataset(name, shape=data.shape, dtype=data.dtype)
# root[name][...] = data
# zarr_array[...]
Problem description
Scalar coordinates are useful as coordinates in xarray and likely other situations. Serializing them in zarr in a zipstore would be cool!.
xref: https://github.com/pydata/xarray/issues/3815
I think this works in the typical directory store.
Version and installation information
Please provide the following:
- Value of
zarr.__version__:2.4.0 - Value of
numcodecs.__version__:0.6.4 - Version of Python interpreter: 3.7
- Operating system (Linux/Windows/Mac): linux
- How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): conda, conda-forge
Also, if you think it might be relevant, please provide the output from pip freeze or
conda env export depending on which was used to install Zarr.
Ah missed this was string related. Sorry about that. On the bright side this may be an easy resolution.
Basically we need an object_codec specified for things that are not bytes-like, which includes strings. There's a good example in this string section.
Thoughts @hmaarrfk? 🙂
I may be able to work on this stuff after October.
Thanks for looking into this with me.
honestly, i ligitimitely might have to revisit this now.
For this, why is it not a problem with the standard store?
Shouldn't this be definied higher up, and not specifically related to the ZipStore?
I guess the correct location to put this is in normalize_dtype
diff --git a/zarr/util.py b/zarr/util.py
index 241009c..c432ed3 100644
--- a/zarr/util.py
+++ b/zarr/util.py
@@ -135,6 +135,9 @@ def normalize_chunks(chunks, shape, typesize):
def normalize_dtype(dtype, object_codec):
+ # Ensure that all types of numpy unicode strings are treaded as strings
+ if np.issubdtype(np.unicode_, dtype):
+ dtype = str
# convenience API for object arrays
if inspect.isclass(dtype):
dtype = dtype.__name__
Did you try using an object codec as noted here ( https://github.com/zarr-developers/zarr-python/issues/551#issuecomment-604231507 )? That's typically how we recommend handling Python objects (like str).
unfortunately, it ignores it because dtype != object
Recent work by @abergou may have improved the situation with object codecs.
Guessing that is referring to PR ( https://github.com/zarr-developers/zarr-python/pull/813 ) in Zarr 2.9.4+