zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Enhancing codec descriptions

Open pvanlaake opened this issue 2 months ago • 15 comments

In the specification, codecs are described like this (bytes):

Defines an array -> bytes codec that encodes arrays of fixed-size numeric data types as a sequence of bytes in lexicographical order.

But surely this is only half the story. As the term implies, a codec has an encoding operation and an inverse decoding operation. So can this be modified like so:

Defines an array -> bytes encoder that transforms array~~s of fixed-size numeric~~ data ~~types~~ as a sequence of bytes in ~~lexicographical~~ the order the data is stored in the array, and a bytes -> array decoder for the inverse operation.

The strike-outs are some further comments on the bytes codec which presumably should operate on data of any type, including non-numerical data types such as logical, datetime and string. I also find the term "lexicographical order" rather vague in this context (I have not been able to find any formal definition of it in the spec) and the only practical arrangement that I can think of is that the byte stream respects the order of the data in the array.

Back to the codecs: the other codecs could similarly use a more explicit reference to encoding and decoding.

pvanlaake avatar Sep 29 '25 08:09 pvanlaake

I agree that the language could be made more precise. We should definitely add language that explains what is being lexicographically ordered here -- the array indices, and thereby the elements of the array in the output byte stream.

For a 2x2 array, the bytes codec will generate a byte stream with elements arranged in the following order (in 2-d array indices): ((0,0), (0, 1), (1, 0), (1,1)). An example like this would probably be helpful.

d-v-b avatar Sep 29 '25 08:09 d-v-b

On the ordering, this is dependent on how the data is stored in the array. Your example is assuming row-major ordering, but there can also be column-major ordering, in which case the order becomes: ((0,0), (1,0), (0,1), (1,1))

pvanlaake avatar Sep 29 '25 08:09 pvanlaake

On the ordering, this is dependent on how the data is stored in the array. Your example is assuming row-major ordering, but there can also be column-major ordering, in which case the order becomes: ((0,0), (1,0), (0,1), (1,1))

that's right, we are assuming C-contiguous order here. The spec should make this much more clear.

d-v-b avatar Sep 29 '25 09:09 d-v-b

The strike-outs are some further comments on the bytes codec which presumably should operate on data of any type, including non-numerical data types such as logical, datetime and string.

The bytes codec declares the physical representation of a data type, which is why it contains a finite set of supported data types. This is necessary because the zarr v3 data type model does not define a physical representation. A consequence of this design is that every new data type requires creating a new array-bytes codec, or amending an existing one, to define its physical representation! There's almost certainly a way to make this simpler.

d-v-b avatar Sep 29 '25 09:09 d-v-b

On the ordering, this is dependent on how the data is stored in the array. Your example is assuming row-major ordering, but there can also be column-major ordering, in which case the order becomes: ((0,0), (1,0), (0,1), (1,1))

that's right, we are assuming C-contiguous order here. The spec should make this much more clear.

I should clarify, this statement is inaccurate. There is no requirement that the input to the bytes codec have its elements arranged in a particular memory order. Codecs like the transpose codec can return views of an array where the order of array dimensions have been permuted, without changing the underlying representation of the array. The only requirement for the bytes codec is that the input array is serialized in lexicographic order with respect to its array indices.

d-v-b avatar Sep 29 '25 09:09 d-v-b

A consequence of this design is that every new data type requires creating a new array-bytes codec,

Hopefully not. We are already defining how extension data types interact with codecs like bytes in the zarr-extensions repo.

For example, https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/complex_float8_e8m0fnu#bytes

complex_float8_e8m0fnu data type

Codec compatibility

bytes

Encoded as 2 consecutive (real component followed by imaginary component) 1-byte values, each encoded as specified by the float8_e8m0fnu data type. The "endian" parameter has no effect.

Zarr V3 still needs a standardised codec for variable length data types that is not coupled to the data type (like this one).

LDeakin avatar Sep 29 '25 09:09 LDeakin

@LDeakin IMO a data type spec declaring how the bytes codec should operate on it is an instance of the second branch of the "creating a new array-bytes codec, or amending an existing one". This practice makes a close read of the bytes codec spec tricky -- we should probably fix the spec if data types specs are defining addenda to codec specs.

d-v-b avatar Sep 29 '25 09:09 d-v-b

This all points to adding a paragraph or two to the bytes codec on datatype extensions.

As I understand it, there are two properties of any data type that need to be considered:

  1. It's binary representation. This is straightforward for the core data types, as well as for well-defined other data types such as UTF-8 or datetime, and should be mostly (entirely?) irrelevant for the bytes codec. For the additionally defined datatype extensions I can see some more exotic definitions that are possibly only relevant within a certain field of application and I cannot say anything about how that impacts the bytes codec.
  2. Byte ordering in a stream for transmission or storage. This is the core of what the bytes codec must deal with. So as long as the binary representation of a datatype is known, this step should be straightforward. This is also irrespective of length. A vlen, for instance, would just be a stream of bytes from the perspective of the bytes codec and effectively be a no-op, with the code behind the extension definition dealing with interpretation of the vlen object.

Am I seeing that correctly?

pvanlaake avatar Sep 29 '25 10:09 pvanlaake

This all points to adding a paragraph or two to the bytes codec on datatype extensions.

As I understand it, there are two properties of any data type that need to be considered:

  1. It's binary representation. This is straightforward for the core data types, as well as for well-defined other data types such as UTF-8 or datetime, and should be mostly (entirely?) irrelevant for the bytes codec. For the additionally defined datatype extensions I can see some more exotic definitions that are possibly only relevant within a certain field of application and I cannot say anything about how that impacts the bytes codec.

There isn't necessarily just one binary representation. For a given data type one can be specified for the bytes codec.

Variable-length data types aren't supported by the bytes codec, instead vlen-bytes or a yet-to-be-proposed one can be used.

  1. Byte ordering in a stream for transmission or storage. This is the core of what the bytes codec must deal with. So as long as the binary representation of a datatype is known, this step should be straightforward. This is also irrespective of length. A vlen, for instance, would just be a stream of bytes from the perspective of the bytes codec and effectively be a no-op, with the code behind the extension definition dealing with interpretation of the vlen object.

Am I seeing that correctly?

Big endian doesn't just mean reverse the byte representation of each element --- see e.g. complex numbers.

For each data type supported by the bytes codec, you need to define the big and little endian representations (which may be the same). It would be nicer if this could be done in a consistent place for both core and extension data types but I don't think it matters that much and the core/extension split into two repositories makes that difficult.

jbms avatar Sep 29 '25 11:09 jbms

As for making the decoding operation explicit --- codec means encoder/decoder. For lossless ones (all that are currently specified) the decoding operation doesn't need to be defined mathematically (since it is uniquely determined as the inverse of encoding) though there could still be relevant implementation notes. For a lossy codec, both the encoding and decoding operations should indeed be specified.

jbms avatar Sep 29 '25 11:09 jbms

It would be nicer if this could be done in a consistent place for both core and extension data types but I don't think it matters that much and the core/extension split into two repositories makes that difficult.

Evidently this is a source of confusion. I think we could have a non-breaking clarification with the following operation:

  • We move content from the bytes codec data types table to the specs for those data types. We call this representation the set of default binary representations for arrays of that data type.
  • Codecs can refer to the "default binary serialization" of a data type.
  • The bytes codec no longer references specific data types. Instead, it can say "this codec generates the default binary serialization (parametrized by endianness) of any array with a fixed-sized data type. Arrays with variable-sized data types are not compatible with the bytes codec".
  • Other codecs are free to define alternative, data type-specific binary representations.

d-v-b avatar Sep 29 '25 11:09 d-v-b

It would be nicer if this could be done in a consistent place for both core and extension data types but I don't think it matters that much and the core/extension split into two repositories makes that difficult.

Evidently this is a source of confusion. I think we could have a non-breaking clarification with the following operation:

  • We move content from the bytes codec data types table to the specs for those data types. We call this representation the set of default binary representations for arrays of that data type.
  • Codecs can refer to the "default binary serialization" of a data type.
  • The bytes codec no longer references specific data types. Instead, it can say "this codec generates the default binary serialization (parametrized by endianness) of any array with a fixed-sized data type. Arrays with variable-sized data types are not compatible with the bytes codec".
  • Other codecs are free to define alternative, data type-specific binary representations.

That just adds an extra indirection for the bytes codec and treating the bytes codec specially may create more confusion.

The same issue also applies to packbits, and in general a new codec may support existing data types, and a new data type may support existing codecs.

jbms avatar Sep 29 '25 11:09 jbms

That just adds an extra indirection for the bytes codec

The goal is to be consistent in our use of indirection. Today we are inconsistently indirect, which is confusing -- the bytes codec directly defines its compatibility for some data types (the core data types), while compatibility with other data types is indirectly defined. IMO it would be better to be consistent here, and include text in the bytes codec definition that describes how it merely re-uses a property defined on data types. And this doesn't make the bytes codec special -- another codec could use the same indirection.

d-v-b avatar Sep 29 '25 11:09 d-v-b

That just adds an extra indirection for the bytes codec

The goal is to be consistent in our use of indirection. Today we are inconsistently indirect, which is confusing -- the bytes codec directly defines its compatibility for some data types (the core data types), while compatibility with other data types is indirectly defined. IMO it would be better to be consistent here, and include text in the bytes codec definition that describes how it merely re-uses a property defined on data types. And this doesn't make the bytes codec special -- another codec could use the same indirection.

Effectively that is what we have, where the property is "the element representation for the bytes codec".

But for packbits since it is an extension, with the current core/extensions split the definition for the core data types has to be a part of the packbits definition.

It is true that for both bytes and packbits the representation could be described in terms of some general principles in order to avoid explicitly defining the representation for each data type, but I think that would be a lot less clear than just stating the representation explicitly for each data type.

jbms avatar Sep 29 '25 12:09 jbms

currently, the packbits codec refers to the behavior of the bytes codec. I might be in the minority but I find it surprising that implementing the packbits codec requires understanding the bytes codec spec. IMO It would be more intuitive for the packbits codec and the bytes codec to share a dependency on some definition that's not yet another codec, and the natural place to put that would be with data types. And including a default binary representation per data type would also be intuitive for people who expect to find such information associated with a data type definition.

d-v-b avatar Sep 29 '25 13:09 d-v-b

As for making the decoding operation explicit --- codec means encoder/decoder. For lossless ones (all that are currently specified) the decoding operation doesn't need to be defined mathematically (since it is uniquely determined as the inverse of encoding) though there could still be relevant implementation notes. For a lossy codec, both the encoding and decoding operations should indeed be specified.

I don't think that mathematical inversability and its unique determination in both directions based on a description of either process is adequate in the context of the core specifications. Not all users of the core specification are mathematicians, after all. How much does it hurt to be more verbose in the core specification? Should the specs use the term "encoder" instead of "codec" perhaps?

That just adds an extra indirection for the bytes codec and treating the bytes codec specially may create more confusion.

Indirection should not be an issue, just so long as any indirection leads to an unequivocal answer/result. But why would this be specific to the bytes codec? I just used the bytes codec as an example in my opening post, the same would apply to all codecs.

pvanlaake avatar Nov 14 '25 20:11 pvanlaake

As for making the decoding operation explicit --- codec means encoder/decoder. For lossless ones (all that are currently specified) the decoding operation doesn't need to be defined mathematically (since it is uniquely determined as the inverse of encoding) though there could still be relevant implementation notes. For a lossy codec, both the encoding and decoding operations should indeed be specified.

I think in some cases, such as the bytes codec, the decode definition is sufficiently straightforward based on the encode definition that stating it separately wouldn't add any value. But certainly in other cases there may be non-obvious details that could be mentioned.

I don't think that mathematical inversability and its unique determination in both directions based on a description of either process is adequate in the context of the core specifications. Not all users of the core specification are mathematicians, after all. How much does it hurt to be more verbose in the core specification? Should the specs use the term "encoder" instead of "codec" perhaps?

That just adds an extra indirection for the bytes codec and treating the bytes codec specially may create more confusion.

Indirection should not be an issue, just so long as any indirection leads to an unequivocal answer/result. But why would this be specific to the bytes codec? I just used the bytes codec as an example in my opening post, the same would apply to all codecs.

This comment was in reference to a specific suggestion by @d-v-b that applied only to the bytes codec.

jbms avatar Nov 14 '25 21:11 jbms

I think in some cases, such as the bytes codec, the decode definition is sufficiently straightforward based on the encode definition that stating it separately wouldn't add any value. But certainly in other cases there may be non-obvious details that could be mentioned.

From the perspective of the user of the specifications it would no doubt be much clearer if both operations were made explicit.

pvanlaake avatar Nov 14 '25 22:11 pvanlaake

I think in some cases, such as the bytes codec, the decode definition is sufficiently straightforward based on the encode definition that stating it separately wouldn't add any value. But certainly in other cases there may be non-obvious details that could be mentioned.

From the perspective of the user of the specifications it would no doubt be much clearer if both operations were made explicit.

Here is the current description for bytes:

Each element of the array is encoded using the specified endian variant of its binary representation listed below. Array elements are encoded in lexicographical order. For example, with endian specified as big, the int32 data type is encoded as a 4-byte big endian two’s complement integer, and the complex128 data type is encoded as two consecutive 8-byte big endian IEEE 754 binary64 values.

Arguably this is already just a description of the format rather than an algorithm for encoding/decoding. Are you proposing that we call this the "encode" algorithm and for the "decode" algorithm have the same text except that we substitute "decoded" for "encoded"?

jbms avatar Nov 14 '25 22:11 jbms

Yes and no.

Yes for describing the inverse operation.

No for just repeating the same text with a single operational verb inverted, the suggestion of which is really just a perversion of my initial post. But even in these limited terms, there could be a more concise description of the inverse operation.

The problem, as I see it, is that the inverse operation is not mentioned at all in neither codec, nor the fact that going from store-level bytes to an array of data requires inverting the entire chain of codecs. Currently that requires inference and I am strongly arguing against that given that this is a document that sets standards for implementers. Better to be accurate than to leave things open to interpretation.

pvanlaake avatar Nov 14 '25 22:11 pvanlaake

Yes and no.

Yes for describing the inverse operation.

No for just repeating the same text with a single operational verb inverted, the suggestion of which is really just a perversion of my initial post. But even in these limited terms, there could be a more concise description of the inverse operation.

I'm certainly not opposed to making the spec more clear. But as far as this specific point I'm honestly not sure what additional clarifications you are looking for --- perhaps it would be simplest to just open a PR.

Looking at the codecs defined in the zarr-specs repo:

  • Blosc and gzip codec: this just references an external format and doesn't directly describe encoding or decoding.
  • Sharding: Already has detailed information about encoding and decoding
  • Bytes: Discussed already
  • Crc32c: I suppose we could add a note that implementations should verify the crc32c checksum when decoding.
  • Transpose: I think this is similar to bytes in that the current description works equally well for encoding and decoding.

The problem, as I see it, is that the inverse operation is not mentioned at all in neither codec, nor the fact that going from store-level bytes to an array of data requires inverting the entire chain of codecs. Currently that requires inference and I am strongly arguing against that given that this is a document that sets standards for implementers. Better to be accurate than to leave things open to interpretation.

For the overall encoding/decoding there are separate descriptions here:

https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#encoding-procedure https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#decoding-procedure

jbms avatar Nov 14 '25 22:11 jbms