[v3] codec design in v3
In V3 today, we have two, or maybe three, ways of representing codecs (codices?)
- the codec configuration, which is basically the set of arguments that define the behavior of the codec. E.g., for zstandard:
class ZstdCodecConfigurationMetadata: level: int = 0 checksum: bool = False - the codec metadata, which is just the codec configuration + a name. Again, for zstandard:
class ZstdCodecMetadata:
configuration: ZstdCodecConfigurationMetadata
name: Literal["zstd"] = field(default="zstd", init=False)
- There is also the codec itself, which is created by first picking a codec class (using the
namefield of the codec metadata), then instantiating an instance of that class using the parameters in theconfiguration.
class ZstdCodec(BytesBytesCodec):
configuration: ZstdCodecConfigurationMetadata
is_fixed_size = True
@classmethod
def from_metadata(cls, codec_metadata: CodecMetadata) -> ZstdCodec:
assert isinstance(codec_metadata, ZstdCodecMetadata)
return cls(configuration=codec_metadata.configuration)
@classmethod
def get_metadata_class(cls) -> Type[ZstdCodecMetadata]:
return ZstdCodecMetadata
def _compress(self, data: bytes) -> bytes:
ctx = ZstdCompressor(
level=self.configuration.level, write_checksum=self.configuration.checksum
)
return ctx.compress(data)
def _decompress(self, data: bytes) -> bytes:
ctx = ZstdDecompressor()
return ctx.decompress(data)
# additional methods omitted
For working with data (compression and decompression), we actually only need the last representation -- the first two representations only exist as a means to se/deserialize the actual codec. But reifying the metadata representation of the codec as standalone classes results in a cluttered API -- You might naively expect the ZstdCodec codec, which is parameterized by a level, to have a level attribute, but it doesn't, because that attribute lives in ZstdCodec.configuration. We could get around this with property decorators, but we could also just make level an attribute of ZstdCodec and do away with the configuration attribute.
I think we can make this a lot simpler, and it will improve the developer experience by reducing indirection and duplicated representations of codecs. Coincidentally, it would bring us closer to the current API design of numcodecs. Specifically, I think we should:
- we remove the
XCodecMetadataandXCodecConfigurationclasses, and move their logic intoto_dict/from_dictmethods forXCodec. - remove
get_metadataandfrom_metadata_classmethods - make the parameters of the codecs attributes of the codec classes -- if
ZstdCodecis parameterized by alevel, thenZstdCodec.levelshould be an attribute, andZstdCodecshould takelevelin its constructor.
I plan on implementing this in a PR, but if someone beats me to it that would also be great. And of course if anyone thinks this is a terrible idea, I'm happy to discuss.
cc @normanrz
* we remove the `XCodecMetadata` and `XCodecConfiguration` classes, and move their logic into `to_dict` / `from_dict` methods for `XCodec`. * remove `get_metadata` and `from_metadata_class` methods * make the parameters of the codecs attributes of the codec classes -- if `ZstdCodec` is parameterized by a `level`, then `ZstdCodec.level` should be an attribute, and `ZstdCodec` should take `level` in its constructor.I plan on implementing this in a PR, but if someone beats me to it that would also be great. And of course if anyone thinks this is a terrible idea, I'm happy to discuss.
Yes, I was planning on doing that but held off to wait for your attrs-removal work because there is a large overlap. What is the status of that?
Yes, I was planning on doing that but held off to wait for your attrs-removal work because there is a large overlap. What is the status of that?
It's progressing -- I opened a PR for it here: https://github.com/zarr-developers/zarr-python/pull/1660
If you implement the proposal in this issue, I would love to merge it into my pr
I think it could potentially work to just have a single class, but there are a few complications to consider:
When creating a new array, you may want the user to be able to specify just some codec properties and have the rest be filled in automatically based on other information propagated from the metadata. For example, with blosc you may want to choose typesize and shuffle automatically based on the data type in the metadata. Or with sharding_indexed you may want to choose the sub-chunk shape automatically.
I made these changes in #1660. I'll think a bit more about how the user ergonomics can be improved. There are already hooks for creating new versions of the codec internally that contain filled in configuration (ie. using the evolve method).
Unless anyone objects, I'd like to keep this issue open until v3 is finished, in order to capture all discussion about the core codec design in v3