zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[v3] support for object arrays

Open jhamman opened this issue 1 year ago • 16 comments

Zarr-Python 2 supported object arrays. This functionality has not made it into Zarr-Python 3 yet (in part because there is not an obvious way to develop a v3 dtype for arbitrary Python objects).

An example demonstrating this functionality using Zarr-Python 2:

z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
z[0] = 42
z[1] = 'foo'
z[2] = ['bar', 'baz', 'qux']
z[3] = {'a': 1, 'b': 2.2}
z[:]
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)

This issue tracks the development of object array support in Zarr-Python 3.

jhamman avatar Jan 02 '25 03:01 jhamman

I ran into this issue in https://github.com/pipefunc/pipefunc/pull/523/

in part because there is not an obvious way to develop a v3 dtype for arbitrary Python objects

Do you expect that object arrays will be supported at some early v3.* release?

basnijholt avatar Jan 09 '25 19:01 basnijholt

my main concern with the object dtype is the danger associated with using pickle, or any other encoding of python objects that could result in arbitrary code execution. but I don't think we have reached a formal decision on object arrays in v3.

d-v-b avatar Jan 09 '25 19:01 d-v-b

I am aware of that limitation/issue. In pipefunc we register our codec that uses cloudpickle.

basnijholt avatar Jan 09 '25 21:01 basnijholt

so we chatted about this in the developer meeting, the conclusion was that supporting object dtype arrays directly is not in-scope for zarr-python 3.x, because of security concerns inherent to storing arbitrary python objects, and our commitment to keep zarr a format that's accessible to a wide range of languages.

that being said, we would be interested in identifying how zarr-python 3.x could be extended in a third party library to add features like an object dtype. Our dtypes today are not extensible, but I think this could be fixed, but this would require some design work first. Is that process something you would be interested in?

d-v-b avatar Jan 10 '25 15:01 d-v-b

@d-v-b Hi there. The NWB team would be interested in the idea being able to extend zarr-python to add object dtype. We would also be happy to work on this. You mentioned that currently we are not able to extend to create new dtypes?

mavaylon1 avatar Jan 28 '25 21:01 mavaylon1

correct, we haven't put together an API for user-defined dtypes in zarr python 3 yet. We definitely intend to add this feature, and we have a very promising proposal here: #2750. But I can't give you a definite timeline for when this feature would be released.

d-v-b avatar Jan 28 '25 21:01 d-v-b

I saw #2750 was closed in favor of https://github.com/zarr-developers/zarr-python/pull/2874.

I tried running the example https://github.com/zarr-developers/zarr-python/issues/2617#issue-2765408806 by @jhamman but on the latest main branch of zarr-python but it raises:

ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117

It is not clear to me whether #2874 was supposed to add support of me and going through the 7000+ LoC change is not an easy task 😅

Any updates here?

basnijholt avatar Oct 10 '25 17:10 basnijholt

hi @basnijholt see https://github.com/zarr-developers/zarr-python/issues/3077 for a discussion leading to the conclusion that zarr python will not attempt to do any data type inference for arrays that use the numpy "object" dtype.

To get your use case (JSON) working today, 2 things need to happen:

  1. We need to define a JSON data type in zarr-python. Instead of passing dtype='object' or dtype = np.dtype('O'), you would pass in dtype=JSON()
  2. We need to ensure that the JSON codec is properly wired up to the JSON data type.

We already did this for two other "object" data types (variable-length bytes and variable-length strings); we would just linearly extrapolate from that process to add JSON I think.

d-v-b avatar Oct 10 '25 17:10 d-v-b

and apologies for the unclear communication here -- getting these wrinkles documented properly is really important, and I haven't had the time to address that. The data types refactor was a lot of effort and so stuff like the JSON data type slipped through the cracks.

d-v-b avatar Oct 10 '25 17:10 d-v-b

Hey @d-v-b, thanks for the pointers and extra context!

My project here is https://github.com/pipefunc/pipefunc, which runs DAG-style pipelines, sweeps parameters in up to N dimensions, and stores every node’s return value in backends like Zarr. Those returns can be arbitrary Python objects, so we stash the cloudpickle.dumps blobs in np.ndarray[object] containers. The new guard blocking dtype=np.dtype("O") tripped me at first, but spelling out a dtype that already wires in the vlen-bytes codec solves it:

import cloudpickle
import numpy as np
import zarr
from zarr.core.dtype import VariableLengthBytes

payloads = np.array([cloudpickle.dumps(obj) for obj in pipefunc_results], dtype=object)

z = zarr.create_array({}, shape=payloads.shape, chunks=payloads.shape, dtype=VariableLengthBytes(), fill_value=b"")

z[:] = payloads
roundtrip = z[:]
assert all(cloudpickle.loads(blob) == original for blob, original in zip(roundtrip, pipefunc_results))

I still see the UnstableSpecificationWarning, but the bytes round-trip cleanly so Zarr stays in the backend mix. Does relying on VariableLengthBytes() sound like the right long-term approach for this storage pattern, or would you recommend something else for serialized Python objects? Depending on that answer, I’ll update pipefunc’s pyproject.toml to drop the <3 cap on zarr (current pin lives at https://github.com/pipefunc/pipefunc/blob/d423059b28ce11d6a1a7baa7b637d7966c8935d1/pyproject.toml#L43) so users can pick up 3.x immediately.

Appreciate the clarification!

basnijholt avatar Oct 10 '25 17:10 basnijholt

Hey @d-v-b, thanks for the pointers and extra context!

My project here is https://github.com/pipefunc/pipefunc, which runs DAG-style pipelines, sweeps parameters in up to N dimensions, and stores every node’s return value in backends like Zarr. Those returns can be arbitrary Python objects, so we stash the cloudpickle.dumps blobs in np.ndarray[object] containers. The new guard blocking dtype=np.dtype("O") tripped me at first, but spelling out a dtype that already wires in the vlen-bytes codec solves it:

import cloudpickle import numpy as np import zarr from zarr.core.dtype import VariableLengthBytes

payloads = np.array([cloudpickle.dumps(obj) for obj in pipefunc_results], dtype=object)

z = zarr.create_array({}, shape=payloads.shape, chunks=payloads.shape, dtype=VariableLengthBytes(), fill_value=b"")

z[:] = payloads roundtrip = z[:] assert all(cloudpickle.loads(blob) == original for blob, original in zip(roundtrip, pipefunc_results))

I still see the UnstableSpecificationWarning, but the bytes round-trip cleanly so Zarr stays in the backend mix. Does relying on VariableLengthBytes() sound like the right long-term approach for this storage pattern, or would you recommend something else for serialized Python objects? Depending on that answer, I’ll update pipefunc’s pyproject.toml to drop the <3 cap on zarr (current pin lives at https://github.com/pipefunc/pipefunc/blob/d423059b28ce11d6a1a7baa7b637d7966c8935d1/pyproject.toml#L43) so users can pick up 3.x immediately.

Appreciate the clarification!

The user-facing API of VariableLengthBytes should be stable for the foreseeable future, but the JSON form of that data type uses the identifier vlen-bytes which is unfortunate because there's a published spec for the equivalent data type that only differs in its name (bytes instead of vlen-bytes).

I haven't decided yet how to handle this for writing -- either we change the output serialization of VariableLengthBytes to use the on-spec "bytes" identifier, which is a breaking change but moving in the right direction, and also clearly heralded by the warning message. Or we just add a new data type, maybe VariableLengthBytes2, which uses the "bytes" identifier, make this one the default for zarr-python, but we keep the old VariableLengthBytes around, warts and all.

Either way, users working with bytes data should not have to see warnings.

d-v-b avatar Oct 10 '25 17:10 d-v-b

or would you recommend something else for serialized Python objects?

The answer to this question depends a bit on the scope of your project. Are these zarrified python objects exposed publicly, or are they an internal detail of your library? If they are public, then a data type that explicitly models "arbitrary python objects" and uses pickle for encoding / decoding would be more direct (but also super dangerous, because this allows arbitrary code execution). Zarr python 2.x supported this but we chose not to add direct support for it in v3 out of security concerns. But you could totally implement such a thing as user-defined data type.

d-v-b avatar Oct 10 '25 17:10 d-v-b

I can see the value of a "pickled python object" dtype extension, provided it came with the necessary safety warnings.

rabernat avatar Oct 10 '25 17:10 rabernat

Hey @d-v-b, thanks for the follow-up—and for the reality check on security.

In pipefunc, the Zarr integration lives in pipefunc/pipefunc/map/_storage_array/_zarr.py. We create arrays with dtype=object and rely on a custom CloudPickleCodec so every cell stores the cloudpickle.dumps(...) payload returned by a pipeline node.

A couple of clarifications:

  • Those arrays are an internal storage. Researchers (most users AFAIK) on the same project reopen them via pipefunc utilities (e.g., pipefunc.map.load_all_outputs("run_folder") or pipefunc.map.load_xarray_dataset("run_folder") which are different representations of the same data) and then export derived artefacts. The raw “zarrified python objects” are never handed to untrusted parties (or at least it shouldn't).
  • Because the reader already trusts the writer, the usual pickle hazard is acceptable in this setting.

Given that trust boundary, we’ll standardize on the built-in VariableLengthBytes() dtype and keep handling the pickle/unpickle step around the array ourselves. That mirrors our existing behaviour and stays on the supported surface while Zarr’s dtype-extension story is still taking shape.

For what it’s worth, one of the reasons I love Zarr is its extensibility: we can drop in a custom codec, point at cloud storage, or integrate with other tooling with almost no friction. If the project were restricted to purely numerical or standard types it would lose a lot of that appeal to workflow engines like pipefunc, where intermediate artefacts aren’t always tidy NumPy scalars.

basnijholt avatar Oct 10 '25 18:10 basnijholt

@basnijholt thanks for explaining your use case!

Since this is all internal we can ignore the security issues. Modelling these objects as variable-length byte strings and picking / unpicking outside of the codec logic would work, you could also define a PyObject data type that links up with a CloudPIckleCodec (you'd have to define that one too). This would treat picking / unpickling as part of the array compression process, and would be a bit more direct than using the variable-length bytes dtype. But it also requires writing a bit more code.

fwiw, I don't think our documentation covers this procedure nearly well enough.

d-v-b avatar Oct 10 '25 18:10 d-v-b

Just for completeness' sake, here is the freshly merged PR (https://github.com/pipefunc/pipefunc/pull/523) that converts PipeFunc from v2 to v3 using VariableLengthBytes.

basnijholt avatar Nov 04 '25 20:11 basnijholt