zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Bytes array + ellipsis slicing -> IndexError: too many indices for array; expected 0, got 1

Open cmalinmayor opened this issue 3 months ago • 7 comments

Zarr version

v3.1.2

Numcodecs version

v0.16.2

Python Version

3.12

Operating System

Mac

Installation

uv/pip

Description

Related to https://github.com/zarr-developers/zarr-python/issues/2436 but not exactly the same - trying to save a bytes array in a MemoryStore causes IndexError.

It does warn me that I should use bytes at my own risk 🙃 so perhaps it's not technically a bug? But a little more explanation on why bytes are risky would really be useful - the provided link wasn't very enlightening to me. We are implementing a standard file format ((geff)[http://liveimagetrackingtools.org/geff/latest/]) that we hope to be readable across Java and Python, and storing string arrays as byte arrays seemed to be the consensus for how to ensure readability across languages. Is this a bad idea?

Steps to reproduce

/// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.2", "numpy"]
# ///
import zarr
import numpy as np
import numcodecs

def main() -> None:
    store = zarr.storage.MemoryStore()
    arr = np.array("teststr", dtype=np.bytes_)
    root = zarr.open(store)
    root["test"] = arr


if __name__ == "__main__":
    print(numcodecs.__version__)
    main()

malinmayorc@malinmayorc-lm1 scratch % uv run test_zarr.py
Installed 8 packages in 21ms 0.16.2 /Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/dtype/npy/bytes.py:383: UnstableSpecificationWarning: The data type (NullTerminatedBytes(length=7)) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3. v3_unstable_dtype_warning(self) Traceback (most recent call last): File "/Users/malinmayorc/code/scratch/test_zarr.py", line 18, in main() File "/Users/malinmayorc/code/scratch/test_zarr.py", line 13, in main root["test"] = arr ~~~~^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/group.py", line 2009, in setitem self._sync(self._async_group.setitem(key, value)) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 208, in _sync return sync( ^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 163, in sync raise return_result File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/sync.py", line 119, in _runner return await coro ^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/group.py", line 692, in setitem await async_api.save_array( File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 477, in save_array await new.setitem(slice(None), arr) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/array.py", line 1753, in setitem indexer = BasicIndexer( ^^^^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 602, in init selection_normalized = replace_ellipsis(selection, shape) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 523, in replace_ellipsis check_selection_length(selection, shape) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 488, in check_selection_length err_too_many_indices(selection, shape) File "/Users/malinmayorc/.cache/uv/archive-v0/kGUZcAWKHkSnbW4e_8QNm/lib/python3.12/site-packages/zarr/core/indexing.py", line 78, in err_too_many_indices raise IndexError(f"too many indices for array; expected {len(shape)}, got {len(selection)}") IndexError: too many indices for array; expected 0, got 1

Additional output

No response

cmalinmayor avatar Sep 17 '25 14:09 cmalinmayor

looks like a bug in save_array. this works if you create a zarr array explicitly:

# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.2", "numpy"]
# ///
import zarr
import numpy as np
import numcodecs

def main() -> None:
    store = zarr.storage.MemoryStore()
    arr = np.array("teststr", dtype=np.bytes_)
    root = zarr.open(store)
    root.create_array("test", data= arr)


if __name__ == "__main__":
    print(numcodecs.__version__)
    main()

Personally I don't like the group["name"] = numpy_array pattern but we should definitely fix this bug

d-v-b avatar Sep 17 '25 14:09 d-v-b

if you don't mind me asking, why are you using np.bytes_? the variable length string dtype is generally a better choice

d-v-b avatar Sep 17 '25 14:09 d-v-b

Thanks for the workaround! I can use that for now.

Personally I don't like the group["name"] = numpy_array pattern

Agreed, but we are trying to stay zarr v2 and v3 compatible for now so this patterns avoids any API differences between the versions. Can't remember if it's needed in the create_array case or just open_group/create_group

if you don't mind me asking, why are you using np.bytes_? the variable length string dtype is generally a better choice

I think because java zarr/n5 doesn't support reading the variable length string dtype? But I would be very happy to be wrong about that

cmalinmayor avatar Sep 17 '25 15:09 cmalinmayor

this will work for zarr v2 or v3:

# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.0", "numcodecs<=0.15"]
# ///
import zarr
import numpy as np
import numcodecs

def main() -> None:
    store = {}
    arr = np.array(["foo"], dtype=np.bytes_)
    root = zarr.open(store)
    root.create("test", shape=arr.shape, dtype=arr.dtype, compressor=None)
    root["test"][:] = arr
    print(root["test"][:])

if __name__ == "__main__":
    print(numcodecs.__version__)
    main()

It's not great that there are so many ways to create arrays 🫠

I think because java zarr/n5 doesn't support reading the variable length string dtype? But I would be very happy to be wrong about that

That's a question for the n5 team I guess. The variable length strings here are just UTF-8, which is much more commonly supported in general than fixed-length strings. Here's the specification for the data type, and here's the specification for the codec required for using it.

d-v-b avatar Sep 17 '25 15:09 d-v-b

zarr v2 and v3 compatible

I'm assuming you're talking about zarr-python versions here... why do you need to be compatible with zarr-python 2.x? As far as I know, Zarr python 3 supports the same zarr features as 2.x. N5 support is missing, but I'm guessing you don't need that here.

d-v-b avatar Sep 17 '25 15:09 d-v-b

the underlying bug is not specific to the bytes data type, but rather the fact that the array is 0-dimensional:

# /// script
# requires-python = "==3.12"
# dependencies = ["zarr==3.1.0"]
# ///
import zarr
import numpy as np

store = {}
arr = np.array(1, dtype=int)
root = zarr.open(store)
root["test"] = arr
print(root["test"][:])
"""
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 467, in replace_ellipsis
    check_selection_length(selection, shape)
  File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 432, in check_selection_length
    err_too_many_indices(selection, shape)
  File "/home/d-v-b/.cache/uv/environments-v2/test-3f720821ed2e4bcf/lib/python3.12/site-packages/zarr/core/indexing.py", line 76, in err_too_many_indices
    raise IndexError(f"too many indices for array; expected {len(shape)}, got {len(selection)}")
IndexError: too many indices for array; expected 0, got 1
"""

I think its caused by some routine failing to handle the fact that this is a scalar array, and so ellipsis-based indexing doesn't work. We need to add a special case for scalars.

d-v-b avatar Sep 17 '25 16:09 d-v-b

I'm assuming you're talking about zarr-python versions here... why do you need to be compatible with zarr-python 2.x? As far as I know, Zarr python 3 supports the same zarr features as 2.x. N5 support is missing, but I'm guessing you don't need that here.

Yes, zarr-python versions, because we want the library to be included in/depended on by as many different applications as possible. If research code hasn't been upgraded yet, we don't want to force people to upgrade their zarr python version just to export a GEFF.

cmalinmayor avatar Sep 17 '25 16:09 cmalinmayor