zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Where did "zarr extensions" go, or "v2 to v3 migration guide"?

Open yarikoptic opened this issue 10 months ago • 5 comments
trafficstars

We were doing digging for finding rationale/migration guide on removal of some data types (such as unicode strings) and ran into

  • https://github.com/zarr-developers/zarr-specs/pull/135

which was merged into https://github.com/zarr-developers/zarr-specs/tree/core-protocol-v3.0-dev branch which seems to no longer exist. Search for "This specification is a Zarr protocol extension defining data types" across github pointed to only some forks.

Could someone please help us out and potentially point to

  • discussions on rationale behind removal those data types
  • zarr extensions -- are they to be formalized and then where
  • potentially some "migration" guide for users to go from v2 to v3 given those deprecations?

Thank you in advance!

yarikoptic avatar Jan 13 '25 18:01 yarikoptic

Are you referring to the NumPy fixed-length (zero-padded) string data types, like "|S10" or "<U10"?

jbms avatar Jan 13 '25 18:01 jbms

I am not sure if this is what you are looking for: https://zarr.readthedocs.io/en/latest/user-guide/v3_migration.html @yarikoptic

mavaylon1 avatar Jan 27 '25 14:01 mavaylon1

@mavaylon1 rright ! that answers 2nd question.

@jbms

Are you referring to the NumPy fixed-length (zero-padded) string data types, like "|S10" or "<U10"?

yes, but was overall interested in the destiny of those all docs/protocol/extensions.rst which seems tried to provide extensions to support those datatypes.

yarikoptic avatar Jan 27 '25 22:01 yarikoptic

In talking with folks on the @zarr-developers/steering-council recently, I understand an update to the extensions conversation is coming any day now. Stay tuned.

jhamman avatar Jan 27 '25 22:01 jhamman

We decided to remove them, at least initially, because they introduced a lot of complications and the value was unclear.

  • "O" (python object): This is essentially meaningless as a data type. In practice in zarr v2 it meant you had to check the list of filters to determine the actual data type. This could be replaced in zarr v3 with (yet to be added) data types corresponding to variable-length strings or json.
  • "|S" (fixed-length byte string): Variable-length string is likely to be preferable in almost all cases. Instead, if chunk compatibility with zarr v2 is desired, this could instead be defined as a codec usable with a (yet to be added) variable-length byte string data type.
  • "<U" (fixed-length UTF-32 string): Same caveats as "|S" apply, and in addition UTF-8 encoding would be basically always better than UTF-32. For chunk compatibility this could be defined as a codec for a variable-length unicode string data type.
  • structs: See https://github.com/zarr-developers/zarr-python/issues/2134#issuecomment-2614231477

jbms avatar Jan 28 '25 01:01 jbms