zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

ZipStore fails to handle scalar string arrays

Open hmaarrfk opened this issue 5 years ago • 9 comments

Minimal, reproducible code sample, a copy-pastable example if possible

import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, data=data, shape=data.shape, dtype=data.dtype)
zarr_array[...]

# zarr_array = root.create_dataset(name, shape=data.shape, dtype=data.dtype)
# root[name][...] = data
# zarr_array[...]

Problem description

Scalar coordinates are useful as coordinates in xarray and likely other situations. Serializing them in zarr in a zipstore would be cool!.

xref: https://github.com/pydata/xarray/issues/3815

I think this works in the typical directory store.

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.4.0
  • Value of numcodecs.__version__: 0.6.4
  • Version of Python interpreter: 3.7
  • Operating system (Linux/Windows/Mac): linux
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): conda, conda-forge

Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr.

hmaarrfk avatar Mar 26 '20 04:03 hmaarrfk

Ah missed this was string related. Sorry about that. On the bright side this may be an easy resolution.

Basically we need an object_codec specified for things that are not bytes-like, which includes strings. There's a good example in this string section.

jakirkham avatar Mar 26 '20 05:03 jakirkham

Thoughts @hmaarrfk? 🙂

jakirkham avatar Aug 28 '20 23:08 jakirkham

I may be able to work on this stuff after October.

Thanks for looking into this with me.

hmaarrfk avatar Aug 29 '20 03:08 hmaarrfk

honestly, i ligitimitely might have to revisit this now.

For this, why is it not a problem with the standard store?

Shouldn't this be definied higher up, and not specifically related to the ZipStore?

hmaarrfk avatar Aug 31 '20 18:08 hmaarrfk

I guess the correct location to put this is in normalize_dtype

diff --git a/zarr/util.py b/zarr/util.py
index 241009c..c432ed3 100644
--- a/zarr/util.py
+++ b/zarr/util.py
@@ -135,6 +135,9 @@ def normalize_chunks(chunks, shape, typesize):
 
 def normalize_dtype(dtype, object_codec):
 
+    # Ensure that all types of numpy unicode strings are treaded as strings
+    if np.issubdtype(np.unicode_, dtype):
+        dtype = str
     # convenience API for object arrays
     if inspect.isclass(dtype):
         dtype = dtype.__name__

hmaarrfk avatar Aug 31 '20 18:08 hmaarrfk

Did you try using an object codec as noted here ( https://github.com/zarr-developers/zarr-python/issues/551#issuecomment-604231507 )? That's typically how we recommend handling Python objects (like str).

jakirkham avatar Aug 31 '20 19:08 jakirkham

unfortunately, it ignores it because dtype != object

hmaarrfk avatar Aug 31 '20 19:08 hmaarrfk

Recent work by @abergou may have improved the situation with object codecs.

joshmoore avatar Sep 22 '21 14:09 joshmoore

Guessing that is referring to PR ( https://github.com/zarr-developers/zarr-python/pull/813 ) in Zarr 2.9.4+

jakirkham avatar Sep 22 '21 20:09 jakirkham