[v3] support for object arrays
Zarr-Python 2 supported object arrays. This functionality has not made it into Zarr-Python 3 yet (in part because there is not an obvious way to develop a v3 dtype for arbitrary Python objects).
An example demonstrating this functionality using Zarr-Python 2:
z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
z[0] = 42
z[1] = 'foo'
z[2] = ['bar', 'baz', 'qux']
z[3] = {'a': 1, 'b': 2.2}
z[:]
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)
This issue tracks the development of object array support in Zarr-Python 3.
I ran into this issue in https://github.com/pipefunc/pipefunc/pull/523/
in part because there is not an obvious way to develop a v3 dtype for arbitrary Python objects
Do you expect that object arrays will be supported at some early v3.* release?
my main concern with the object dtype is the danger associated with using pickle, or any other encoding of python objects that could result in arbitrary code execution. but I don't think we have reached a formal decision on object arrays in v3.
so we chatted about this in the developer meeting, the conclusion was that supporting object dtype arrays directly is not in-scope for zarr-python 3.x, because of security concerns inherent to storing arbitrary python objects, and our commitment to keep zarr a format that's accessible to a wide range of languages.
that being said, we would be interested in identifying how zarr-python 3.x could be extended in a third party library to add features like an object dtype. Our dtypes today are not extensible, but I think this could be fixed, but this would require some design work first. Is that process something you would be interested in?
@d-v-b Hi there. The NWB team would be interested in the idea being able to extend zarr-python to add object dtype. We would also be happy to work on this. You mentioned that currently we are not able to extend to create new dtypes?
correct, we haven't put together an API for user-defined dtypes in zarr python 3 yet. We definitely intend to add this feature, and we have a very promising proposal here: #2750. But I can't give you a definite timeline for when this feature would be released.
I saw #2750 was closed in favor of https://github.com/zarr-developers/zarr-python/pull/2874.
I tried running the example https://github.com/zarr-developers/zarr-python/issues/2617#issue-2765408806 by @jhamman but on the latest main branch of zarr-python but it raises:
ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117
It is not clear to me whether #2874 was supposed to add support of me and going through the 7000+ LoC change is not an easy task 😅
Any updates here?
hi @basnijholt see https://github.com/zarr-developers/zarr-python/issues/3077 for a discussion leading to the conclusion that zarr python will not attempt to do any data type inference for arrays that use the numpy "object" dtype.
To get your use case (JSON) working today, 2 things need to happen:
- We need to define a
JSONdata type in zarr-python. Instead of passingdtype='object'ordtype = np.dtype('O'), you would pass indtype=JSON() - We need to ensure that the
JSONcodec is properly wired up to theJSONdata type.
We already did this for two other "object" data types (variable-length bytes and variable-length strings); we would just linearly extrapolate from that process to add JSON I think.
and apologies for the unclear communication here -- getting these wrinkles documented properly is really important, and I haven't had the time to address that. The data types refactor was a lot of effort and so stuff like the JSON data type slipped through the cracks.
Hey @d-v-b, thanks for the pointers and extra context!
My project here is https://github.com/pipefunc/pipefunc, which runs DAG-style pipelines, sweeps parameters in up to N dimensions, and stores every node’s return value in backends like Zarr. Those returns can be arbitrary Python objects, so we stash the cloudpickle.dumps blobs in np.ndarray[object] containers. The new guard blocking dtype=np.dtype("O") tripped me at first, but spelling out a dtype that already wires in the vlen-bytes codec solves it:
import cloudpickle
import numpy as np
import zarr
from zarr.core.dtype import VariableLengthBytes
payloads = np.array([cloudpickle.dumps(obj) for obj in pipefunc_results], dtype=object)
z = zarr.create_array({}, shape=payloads.shape, chunks=payloads.shape, dtype=VariableLengthBytes(), fill_value=b"")
z[:] = payloads
roundtrip = z[:]
assert all(cloudpickle.loads(blob) == original for blob, original in zip(roundtrip, pipefunc_results))
I still see the UnstableSpecificationWarning, but the bytes round-trip cleanly so Zarr stays in the backend mix.
Does relying on VariableLengthBytes() sound like the right long-term approach for this storage pattern, or would you recommend something else for serialized Python objects?
Depending on that answer, I’ll update pipefunc’s pyproject.toml to drop the <3 cap on zarr (current pin lives at https://github.com/pipefunc/pipefunc/blob/d423059b28ce11d6a1a7baa7b637d7966c8935d1/pyproject.toml#L43) so users can pick up 3.x immediately.
Appreciate the clarification!
Hey @d-v-b, thanks for the pointers and extra context!
My project here is https://github.com/pipefunc/pipefunc, which runs DAG-style pipelines, sweeps parameters in up to N dimensions, and stores every node’s return value in backends like Zarr. Those returns can be arbitrary Python objects, so we stash the
cloudpickle.dumpsblobs innp.ndarray[object]containers. The new guard blockingdtype=np.dtype("O")tripped me at first, but spelling out a dtype that already wires in the vlen-bytes codec solves it:import cloudpickle import numpy as np import zarr from zarr.core.dtype import VariableLengthBytes
payloads = np.array([cloudpickle.dumps(obj) for obj in pipefunc_results], dtype=object)
z = zarr.create_array({}, shape=payloads.shape, chunks=payloads.shape, dtype=VariableLengthBytes(), fill_value=b"")
z[:] = payloads roundtrip = z[:] assert all(cloudpickle.loads(blob) == original for blob, original in zip(roundtrip, pipefunc_results))
I still see the
UnstableSpecificationWarning, but the bytes round-trip cleanly so Zarr stays in the backend mix. Does relying onVariableLengthBytes()sound like the right long-term approach for this storage pattern, or would you recommend something else for serialized Python objects? Depending on that answer, I’ll update pipefunc’spyproject.tomlto drop the<3cap on zarr (current pin lives at https://github.com/pipefunc/pipefunc/blob/d423059b28ce11d6a1a7baa7b637d7966c8935d1/pyproject.toml#L43) so users can pick up 3.x immediately.Appreciate the clarification!
The user-facing API of VariableLengthBytes should be stable for the foreseeable future, but the JSON form of that data type uses the identifier vlen-bytes which is unfortunate because there's a published spec for the equivalent data type that only differs in its name (bytes instead of vlen-bytes).
I haven't decided yet how to handle this for writing -- either we change the output serialization of VariableLengthBytes to use the on-spec "bytes" identifier, which is a breaking change but moving in the right direction, and also clearly heralded by the warning message. Or we just add a new data type, maybe VariableLengthBytes2, which uses the "bytes" identifier, make this one the default for zarr-python, but we keep the old VariableLengthBytes around, warts and all.
Either way, users working with bytes data should not have to see warnings.
or would you recommend something else for serialized Python objects?
The answer to this question depends a bit on the scope of your project. Are these zarrified python objects exposed publicly, or are they an internal detail of your library? If they are public, then a data type that explicitly models "arbitrary python objects" and uses pickle for encoding / decoding would be more direct (but also super dangerous, because this allows arbitrary code execution). Zarr python 2.x supported this but we chose not to add direct support for it in v3 out of security concerns. But you could totally implement such a thing as user-defined data type.
I can see the value of a "pickled python object" dtype extension, provided it came with the necessary safety warnings.
Hey @d-v-b, thanks for the follow-up—and for the reality check on security.
In pipefunc, the Zarr integration lives in pipefunc/pipefunc/map/_storage_array/_zarr.py. We create arrays with dtype=object and rely on a custom CloudPickleCodec so every cell stores the cloudpickle.dumps(...) payload returned by a pipeline node.
A couple of clarifications:
- Those arrays are an internal storage. Researchers (most users AFAIK) on the same project reopen them via
pipefuncutilities (e.g.,pipefunc.map.load_all_outputs("run_folder")orpipefunc.map.load_xarray_dataset("run_folder")which are different representations of the same data) and then export derived artefacts. The raw “zarrified python objects” are never handed to untrusted parties (or at least it shouldn't). - Because the reader already trusts the writer, the usual pickle hazard is acceptable in this setting.
Given that trust boundary, we’ll standardize on the built-in VariableLengthBytes() dtype and keep handling the pickle/unpickle step around the array ourselves. That mirrors our existing behaviour and stays on the supported surface while Zarr’s dtype-extension story is still taking shape.
For what it’s worth, one of the reasons I love Zarr is its extensibility: we can drop in a custom codec, point at cloud storage, or integrate with other tooling with almost no friction. If the project were restricted to purely numerical or standard types it would lose a lot of that appeal to workflow engines like pipefunc, where intermediate artefacts aren’t always tidy NumPy scalars.
@basnijholt thanks for explaining your use case!
Since this is all internal we can ignore the security issues. Modelling these objects as variable-length byte strings and picking / unpicking outside of the codec logic would work, you could also define a PyObject data type that links up with a CloudPIckleCodec (you'd have to define that one too). This would treat picking / unpickling as part of the array compression process, and would be a bit more direct than using the variable-length bytes dtype. But it also requires writing a bit more code.
fwiw, I don't think our documentation covers this procedure nearly well enough.
Just for completeness' sake, here is the freshly merged PR (https://github.com/pipefunc/pipefunc/pull/523) that converts PipeFunc from v2 to v3 using VariableLengthBytes.