Bytes array + ellipsis slicing -> IndexError: too many indices for array; expected 0, got 1
Zarr version
v3.1.2
Numcodecs version
v0.16.2
Python Version
3.12
Operating System
Mac
Installation
uv/pip
Description
Related to https://github.com/zarr-developers/zarr-python/issues/2436 but not exactly the same - trying to save a bytes array in a MemoryStore causes IndexError.
It does warn me that I should use bytes at my own risk 🙃 so perhaps it's not technically a bug? But a little more explanation on why bytes are risky would really be useful - the provided link wasn't very enlightening to me. We are implementing a standard file format ((geff)[http://liveimagetrackingtools.org/geff/latest/]) that we hope to be readable across Java and Python, and storing string arrays as byte arrays seemed to be the consensus for how to ensure readability across languages. Is this a bad idea?
Steps to reproduce
/// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.2", "numpy"]
# ///
import zarr
import numpy as np
import numcodecs
def main() -> None:
store = zarr.storage.MemoryStore()
arr = np.array("teststr", dtype=np.bytes_)
root = zarr.open(store)
root["test"] = arr
if __name__ == "__main__":
print(numcodecs.__version__)
main()
malinmayorc@malinmayorc-lm1 scratch % uv run test_zarr.py
Installed 8 packages in 21ms 0.16.2 /Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/dtype/npy/bytes.py:383: UnstableSpecificationWarning: The data type (NullTerminatedBytes(length=7)) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3. v3_unstable_dtype_warning(self) Traceback (most recent call last): File "/Users/malinmayorc/code/scratch/test_zarr.py", line 18, inmain() File "/Users/malinmayorc/code/scratch/test_zarr.py", line 13, in main root["test"] = arr ~~~~^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/group.py", line 2009, in setitem self._sync(self._async_group.setitem(key, value)) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 208, in _sync return sync( ^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 163, in sync raise return_result File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 119, in _runner return await coro ^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/group.py", line 692, in setitem await async_api.save_array( File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 477, in save_array await new.setitem(slice(None), arr) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/array.py", line 1753, in setitem indexer = BasicIndexer( ^^^^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 602, in init selection_normalized = replace_ellipsis(selection, shape) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 523, in replace_ellipsis check_selection_length(selection, shape) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 488, in check_selection_length err_too_many_indices(selection, shape) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 78, in err_too_many_indices raise IndexError(f"too many indices for array; expected {len(shape)}, got {len(selection)}") IndexError: too many indices for array; expected 0, got 1
Additional output
No response
looks like a bug in save_array. this works if you create a zarr array explicitly:
# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.2", "numpy"]
# ///
import zarr
import numpy as np
import numcodecs
def main() -> None:
store = zarr.storage.MemoryStore()
arr = np.array("teststr", dtype=np.bytes_)
root = zarr.open(store)
root.create_array("test", data= arr)
if __name__ == "__main__":
print(numcodecs.__version__)
main()
Personally I don't like the group["name"] = numpy_array pattern but we should definitely fix this bug
if you don't mind me asking, why are you using np.bytes_? the variable length string dtype is generally a better choice
Thanks for the workaround! I can use that for now.
Personally I don't like the
group["name"] = numpy_arraypattern
Agreed, but we are trying to stay zarr v2 and v3 compatible for now so this patterns avoids any API differences between the versions. Can't remember if it's needed in the create_array case or just open_group/create_group
if you don't mind me asking, why are you using
np.bytes_? the variable length string dtype is generally a better choice
I think because java zarr/n5 doesn't support reading the variable length string dtype? But I would be very happy to be wrong about that
this will work for zarr v2 or v3:
# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.0", "numcodecs<=0.15"]
# ///
import zarr
import numpy as np
import numcodecs
def main() -> None:
store = {}
arr = np.array(["foo"], dtype=np.bytes_)
root = zarr.open(store)
root.create("test", shape=arr.shape, dtype=arr.dtype, compressor=None)
root["test"][:] = arr
print(root["test"][:])
if __name__ == "__main__":
print(numcodecs.__version__)
main()
It's not great that there are so many ways to create arrays 🫠
I think because java zarr/n5 doesn't support reading the variable length string dtype? But I would be very happy to be wrong about that
That's a question for the n5 team I guess. The variable length strings here are just UTF-8, which is much more commonly supported in general than fixed-length strings. Here's the specification for the data type, and here's the specification for the codec required for using it.
zarr v2 and v3 compatible
I'm assuming you're talking about zarr-python versions here... why do you need to be compatible with zarr-python 2.x? As far as I know, Zarr python 3 supports the same zarr features as 2.x. N5 support is missing, but I'm guessing you don't need that here.
the underlying bug is not specific to the bytes data type, but rather the fact that the array is 0-dimensional:
# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.0"]
# ///
import zarr
import numpy as np
store = {}
arr = np.array(1, dtype=int)
root = zarr.open(store)
root["test"] = arr
print(root["test"][:])
"""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 467, in replace_ellipsis
check_selection_length(selection, shape)
File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 432, in check_selection_length
err_too_many_indices(selection, shape)
File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 76, in err_too_many_indices
raise IndexError(f"too many indices for array; expected {len(shape)}, got {len(selection)}")
IndexError: too many indices for array; expected 0, got 1
"""
I think its caused by some routine failing to handle the fact that this is a scalar array, and so ellipsis-based indexing doesn't work. We need to add a special case for scalars.
I'm assuming you're talking about zarr-python versions here... why do you need to be compatible with zarr-python 2.x? As far as I know, Zarr python 3 supports the same zarr features as 2.x. N5 support is missing, but I'm guessing you don't need that here.
Yes, zarr-python versions, because we want the library to be included in/depended on by as many different applications as possible. If research code hasn't been upgraded yet, we don't want to force people to upgrade their zarr python version just to export a GEFF.