zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[v3] codec design in v3

Open d-v-b opened this issue 1 year ago • 6 comments

In V3 today, we have two, or maybe three, ways of representing codecs (codices?)

  • the codec configuration, which is basically the set of arguments that define the behavior of the codec. E.g., for zstandard:
    class ZstdCodecConfigurationMetadata:
       level: int = 0
       checksum: bool = False
    
  • the codec metadata, which is just the codec configuration + a name. Again, for zstandard:
class ZstdCodecMetadata:
    configuration: ZstdCodecConfigurationMetadata
    name: Literal["zstd"] = field(default="zstd", init=False)
  • There is also the codec itself, which is created by first picking a codec class (using the name field of the codec metadata), then instantiating an instance of that class using the parameters in the configuration.
class ZstdCodec(BytesBytesCodec):
    configuration: ZstdCodecConfigurationMetadata
    is_fixed_size = True

    @classmethod
    def from_metadata(cls, codec_metadata: CodecMetadata) -> ZstdCodec:
        assert isinstance(codec_metadata, ZstdCodecMetadata)
        return cls(configuration=codec_metadata.configuration)

    @classmethod
    def get_metadata_class(cls) -> Type[ZstdCodecMetadata]:
        return ZstdCodecMetadata
        
    def _compress(self, data: bytes) -> bytes:
        ctx = ZstdCompressor(
            level=self.configuration.level, write_checksum=self.configuration.checksum
        )
        return ctx.compress(data)

    def _decompress(self, data: bytes) -> bytes:
        ctx = ZstdDecompressor()
        return ctx.decompress(data)
    # additional methods omitted    

For working with data (compression and decompression), we actually only need the last representation -- the first two representations only exist as a means to se/deserialize the actual codec. But reifying the metadata representation of the codec as standalone classes results in a cluttered API -- You might naively expect the ZstdCodec codec, which is parameterized by a level, to have a level attribute, but it doesn't, because that attribute lives in ZstdCodec.configuration. We could get around this with property decorators, but we could also just make level an attribute of ZstdCodec and do away with the configuration attribute.

I think we can make this a lot simpler, and it will improve the developer experience by reducing indirection and duplicated representations of codecs. Coincidentally, it would bring us closer to the current API design of numcodecs. Specifically, I think we should:

  • we remove the XCodecMetadata and XCodecConfiguration classes, and move their logic into to_dict / from_dict methods for XCodec.
  • remove get_metadata and from_metadata_class methods
  • make the parameters of the codecs attributes of the codec classes -- if ZstdCodec is parameterized by a level, then ZstdCodec.level should be an attribute, and ZstdCodec should take level in its constructor.

I plan on implementing this in a PR, but if someone beats me to it that would also be great. And of course if anyone thinks this is a terrible idea, I'm happy to discuss.

d-v-b avatar Feb 08 '24 11:02 d-v-b

cc @normanrz

d-v-b avatar Feb 08 '24 12:02 d-v-b

* we remove the `XCodecMetadata` and `XCodecConfiguration` classes, and move their logic into `to_dict` / `from_dict` methods for `XCodec`.

* remove `get_metadata` and `from_metadata_class` methods

* make the parameters of the codecs attributes of the codec classes -- if `ZstdCodec` is parameterized by a `level`, then `ZstdCodec.level` should be an attribute, and `ZstdCodec` should take `level` in its constructor.

I plan on implementing this in a PR, but if someone beats me to it that would also be great. And of course if anyone thinks this is a terrible idea, I'm happy to discuss.

Yes, I was planning on doing that but held off to wait for your attrs-removal work because there is a large overlap. What is the status of that?

normanrz avatar Feb 08 '24 12:02 normanrz

Yes, I was planning on doing that but held off to wait for your attrs-removal work because there is a large overlap. What is the status of that?

It's progressing -- I opened a PR for it here: https://github.com/zarr-developers/zarr-python/pull/1660

If you implement the proposal in this issue, I would love to merge it into my pr

d-v-b avatar Feb 08 '24 14:02 d-v-b

I think it could potentially work to just have a single class, but there are a few complications to consider:

When creating a new array, you may want the user to be able to specify just some codec properties and have the rest be filled in automatically based on other information propagated from the metadata. For example, with blosc you may want to choose typesize and shuffle automatically based on the data type in the metadata. Or with sharding_indexed you may want to choose the sub-chunk shape automatically.

jbms avatar Feb 08 '24 19:02 jbms

I made these changes in #1660. I'll think a bit more about how the user ergonomics can be improved. There are already hooks for creating new versions of the codec internally that contain filled in configuration (ie. using the evolve method).

normanrz avatar Feb 09 '24 17:02 normanrz

Unless anyone objects, I'd like to keep this issue open until v3 is finished, in order to capture all discussion about the core codec design in v3

d-v-b avatar Feb 09 '24 17:02 d-v-b