zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

add bytes dtype

Open d-v-b opened this issue 2 months ago • 2 comments

This PR adds a Bytes dtype that is nearly identical to the existing VariableLengthBytes dtype save a few differences:

  • the V3 JSON form is "bytes" instead of "variable_length_bytes"
  • The fill value representation is an array of ints, instead of a base64-encoded string
  • Bytes is consistent with a published spec, instead of not being described by a spec.

The latter point is the most important.

Because Bytes is nearly identical to VariableLengthBytes, it could be a drop-in replacement for VariableLengthBytes, save for the JSON fill value encoding differences between the two codecs, which raises two questions:

  • is base64 encoding a bytestring a better or worse JSON serialization than representing the same bytestring as an array of integers?
  • could (or should) we amend the bytes data type spec to recommend reading (or reading and writing) both fill value encodings? If so, the bytes data type and the variable_length_bytes data types could be fused completely.

Since zarr python 2.x was saving bytes fill values as base64-encoded strings, there's a bit of inertia there. Would also be good to hear thoughts from other implementers (@LDeakin , @jbms, @manzt )

d-v-b avatar Oct 28 '25 20:10 d-v-b

Codecov Report

:x: Patch coverage is 51.85185% with 52 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 60.73%. Comparing base (b873691) to head (c4010dd).

Files with missing lines Patch % Lines
src/zarr/core/dtype/npy/bytes.py 57.14% 30 Missing :warning:
src/zarr/core/dtype/__init__.py 53.84% 6 Missing :warning:
src/zarr/core/dtype/registry.py 66.66% 4 Missing :warning:
src/zarr/errors.py 0.00% 3 Missing :warning:
src/zarr/core/dtype/npy/bool.py 0.00% 1 Missing :warning:
src/zarr/core/dtype/npy/common.py 50.00% 1 Missing :warning:
src/zarr/core/dtype/npy/complex.py 0.00% 1 Missing :warning:
src/zarr/core/dtype/npy/float.py 0.00% 1 Missing :warning:
src/zarr/core/dtype/npy/int.py 0.00% 1 Missing :warning:
src/zarr/core/dtype/npy/string.py 0.00% 1 Missing :warning:
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3559      +/-   ##
==========================================
- Coverage   60.90%   60.73%   -0.17%     
==========================================
  Files          86       86              
  Lines       10174    10267      +93     
==========================================
+ Hits         6196     6236      +40     
- Misses       3978     4031      +53     
Files with missing lines Coverage Δ
src/zarr/core/dtype/common.py 28.39% <ø> (+0.68%) :arrow_up:
src/zarr/core/dtype/npy/bool.py 45.61% <0.00%> (-0.82%) :arrow_down:
src/zarr/core/dtype/npy/common.py 61.53% <50.00%> (-0.19%) :arrow_down:
src/zarr/core/dtype/npy/complex.py 47.61% <0.00%> (-0.58%) :arrow_down:
src/zarr/core/dtype/npy/float.py 46.87% <0.00%> (-0.50%) :arrow_down:
src/zarr/core/dtype/npy/int.py 53.41% <0.00%> (-0.17%) :arrow_down:
src/zarr/core/dtype/npy/string.py 44.11% <0.00%> (-0.33%) :arrow_down:
src/zarr/core/dtype/npy/structured.py 55.78% <0.00%> (-0.60%) :arrow_down:
src/zarr/core/dtype/npy/time.py 52.84% <0.00%> (-0.31%) :arrow_down:
src/zarr/dtype.py 0.00% <0.00%> (ø)
... and 4 more
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 28 '25 20:10 codecov[bot]

I am assuming that https://github.com/zarr-developers/zarr-extensions/pull/38 will be merged soon, as it was approved by @normanrz.

the Bytes data type now reads the tuple-of-ints JSON fill value defined in the current text of the "bytes" dtype spec, but it reads (and writes) the base64-encoded string form that the future version of the "bytes" dtype spec softly recommends.

The scope of this PR has broadened somewhat because we are effectively replacing one data type (VariableLengthBytes) with a new one (Bytes).

These two data types cannot co-exist in the data type registry, because they have the same zarr v2 JSON representation ({"id": "|O", "object_codec_id": "vlen-bytes"}), and the registry treats "1 JSON input matches 2 or more data type classes" as an error.

So I removed VariableLengthBytes from the registry, and added logic to Bytes to ensure that the JSON string "variable_length_bytes" is an alias for "bytes", which guarantees that old data will be readable by the new data type.

For any users who are committed to the VariableLengthBytes data type for some reason, I added a function to reset the data type registry to the old state (de-registering Bytes and registering VariableLengthBytes in its place), and a function to reverse this transformation. We need to foreground these functions in our documentation for the bytes dtype.

This is not a breaking change for runtime functionality (Bytes and VariableLengthBytes are effectively identical), but I think this is a breaking change for anyone who relies on the round-trip integrity of old data -- data saved with "data_type" : "variable_length_bytes" will be resaved with "data_type": "bytes" after this change. For that reason I think we should ship this in a 3.2 release.

d-v-b avatar Nov 17 '25 09:11 d-v-b