add bytes dtype
This PR adds a Bytes dtype that is nearly identical to the existing VariableLengthBytes dtype save a few differences:
- the V3 JSON form is
"bytes"instead of"variable_length_bytes" - The fill value representation is an array of ints, instead of a base64-encoded string
Bytesis consistent with a published spec, instead of not being described by a spec.
The latter point is the most important.
Because Bytes is nearly identical to VariableLengthBytes, it could be a drop-in replacement for VariableLengthBytes, save for the JSON fill value encoding differences between the two codecs, which raises two questions:
- is base64 encoding a bytestring a better or worse JSON serialization than representing the same bytestring as an array of integers?
- could (or should) we amend the bytes data type spec to recommend reading (or reading and writing) both fill value encodings? If so, the
bytesdata type and thevariable_length_bytesdata types could be fused completely.
Since zarr python 2.x was saving bytes fill values as base64-encoded strings, there's a bit of inertia there. Would also be good to hear thoughts from other implementers (@LDeakin , @jbms, @manzt )
Codecov Report
:x: Patch coverage is 51.85185% with 52 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 60.73%. Comparing base (b873691) to head (c4010dd).
Additional details and impacted files
@@ Coverage Diff @@
## main #3559 +/- ##
==========================================
- Coverage 60.90% 60.73% -0.17%
==========================================
Files 86 86
Lines 10174 10267 +93
==========================================
+ Hits 6196 6236 +40
- Misses 3978 4031 +53
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/zarr/core/dtype/common.py | 28.39% <ø> (+0.68%) |
:arrow_up: |
| src/zarr/core/dtype/npy/bool.py | 45.61% <0.00%> (-0.82%) |
:arrow_down: |
| src/zarr/core/dtype/npy/common.py | 61.53% <50.00%> (-0.19%) |
:arrow_down: |
| src/zarr/core/dtype/npy/complex.py | 47.61% <0.00%> (-0.58%) |
:arrow_down: |
| src/zarr/core/dtype/npy/float.py | 46.87% <0.00%> (-0.50%) |
:arrow_down: |
| src/zarr/core/dtype/npy/int.py | 53.41% <0.00%> (-0.17%) |
:arrow_down: |
| src/zarr/core/dtype/npy/string.py | 44.11% <0.00%> (-0.33%) |
:arrow_down: |
| src/zarr/core/dtype/npy/structured.py | 55.78% <0.00%> (-0.60%) |
:arrow_down: |
| src/zarr/core/dtype/npy/time.py | 52.84% <0.00%> (-0.31%) |
:arrow_down: |
| src/zarr/dtype.py | 0.00% <0.00%> (ø) |
|
| ... and 4 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I am assuming that https://github.com/zarr-developers/zarr-extensions/pull/38 will be merged soon, as it was approved by @normanrz.
the Bytes data type now reads the tuple-of-ints JSON fill value defined in the current text of the "bytes" dtype spec, but it reads (and writes) the base64-encoded string form that the future version of the "bytes" dtype spec softly recommends.
The scope of this PR has broadened somewhat because we are effectively replacing one data type (VariableLengthBytes) with a new one (Bytes).
These two data types cannot co-exist in the data type registry, because they have the same zarr v2 JSON representation ({"id": "|O", "object_codec_id": "vlen-bytes"}), and the registry treats "1 JSON input matches 2 or more data type classes" as an error.
So I removed VariableLengthBytes from the registry, and added logic to Bytes to ensure that the JSON string "variable_length_bytes" is an alias for "bytes", which guarantees that old data will be readable by the new data type.
For any users who are committed to the VariableLengthBytes data type for some reason, I added a function to reset the data type registry to the old state (de-registering Bytes and registering VariableLengthBytes in its place), and a function to reverse this transformation. We need to foreground these functions in our documentation for the bytes dtype.
This is not a breaking change for runtime functionality (Bytes and VariableLengthBytes are effectively identical), but I think this is a breaking change for anyone who relies on the round-trip integrity of old data -- data saved with "data_type" : "variable_length_bytes" will be resaved with "data_type": "bytes" after this change. For that reason I think we should ship this in a 3.2 release.