numcodecs icon indicating copy to clipboard operation
numcodecs copied to clipboard

add raw codec

Open d-v-b opened this issue 2 years ago • 17 comments

This adds a raw codec that doesn't apply any transformation to the content of the input data. If we add this to numcodecs, then we can remove some "is the compressor a compressor or is it None?" logic over in zarr-python.

  • [x] Unit tests and/or doctests in docstrings
  • [ ] tox -e py39 passes locally
  • [ ] Docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] Changes documented in docs/release.rst
  • [ ] tox -e docs passes locally
  • [ ] GitHub Actions CI passes
  • [ ] Test coverage to 100% (Coveralls passes)

d-v-b avatar Jul 06 '22 14:07 d-v-b

This is a reasonable thing to do. Is "raw" the best name? It could be "noop" or "dummy", and I would add more text to the docstring explaining what it's for.

martindurant avatar Jul 06 '22 19:07 martindurant

Or Identity? Passthrough? ...?

jakirkham avatar Jul 06 '22 19:07 jakirkham

I'm open to name alternatives; "noop" might be confusing because there's a (tiny) cost associated with running encode or decode, so it's not literally a no-op. As for "dummy", I'm not sure if that fits since this is a real codec. It was originally called "identity", but I thought "raw" would be more intuitive to users coming from a storage / compression background.

d-v-b avatar Jul 06 '22 19:07 d-v-b

Holding off here for further feedback.

joshmoore avatar Jul 26 '22 03:07 joshmoore

Think Martin's point is raw is unclear. So would suggest changing.

Since dummy can have negative connotations in some contexts, would suggest we avoid that one.

No strong feelings about the other options.

jakirkham avatar Jul 26 '22 03:07 jakirkham

What about NoneCodec i.e. you're either going to get a None object or a NoneCodec implementation, both of which do nothing. This feels quite close to https://en.wikipedia.org/wiki/Null_object_pattern

joshmoore avatar Aug 01 '22 19:08 joshmoore

Think one of the things we discussed the last community call is whether we can alleviate the need for having this at all. IOW have some way to specify no codec is used. This seems like the better solution. Would like a clearer understanding as to the hurdles blocking that today.

jakirkham avatar Aug 01 '22 20:08 jakirkham

Definitely no objection to that, but would that be v3 only?

joshmoore avatar Aug 01 '22 20:08 joshmoore

It's already possible in v2 to specify null for no codec or no filter

Think I'm still missing context on why this is needed. If there is an issue in the spec, we should clarify it, but we may need help from @d-v-b to identify what the issue is so we can fix it

jakirkham avatar Aug 01 '22 20:08 jakirkham

Think I'm still missing context on why this is needed. If there is an issue in the spec, we should clarify it, but we may need help from @d-v-b to identify what the issue is so we can fix it

The type Codec is nicer than the type Union[Codec, None]. If "do nothing" can be expressed as the action of a particular codec (e.g., this one), then we can use that to avoid instances of "if codec is None.... else.... in zarr-python and potentially other places. null can still be used as an alias for the JSON serialization of this codec.

d-v-b avatar Aug 01 '22 21:08 d-v-b

Actually, isn't the codecs entry a List[Codec], which can be empty? This is how it is treated internally in v2 when we specify None, although I'm not immediately sure what is written in the .zarray file.

martindurant avatar Aug 02 '22 14:08 martindurant

In zarr-python, the compressor property of an array is either a Codec or None: https://github.com/zarr-developers/zarr-python/blob/main/zarr/core.py#L264. With a NoneCodec / RawCodec, we can remove the need for None here.

d-v-b avatar Aug 02 '22 20:08 d-v-b

Specifically, adding this codec will obviate the following "is the compressor None or is it a Codec?" checks in zarr-python:

  • https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L295
  • https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L1856
  • https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L2144
  • https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L2193

d-v-b avatar Aug 02 '22 20:08 d-v-b

Thought the typing syntax was Optional[Codec] (though one can write this other ways of course).

In any event given this is a typing thing, am less inclined to add this. If it is helpful to have an alias to typing syntax, that seems ok.

jakirkham avatar Aug 02 '22 20:08 jakirkham

Interesting didn't see your last comment yet. Will need to mull on that particular issue more. Though am still not sure adding a no-op codec is the solution

jakirkham avatar Aug 02 '22 20:08 jakirkham

This is not merely a typing thing -- this PR contains a real codec, not just type aliases :). As a side effect of creating this codec, we can simplify code in zarr-python (which is the goal). I don't really care about type annotations, I'm focused on the actual types.

d-v-b avatar Aug 02 '22 20:08 d-v-b

One could make the argument that a real no-op codec is a better example of how a codec should look that the current abstract base class :)

martindurant avatar Aug 02 '22 20:08 martindurant

any chance of movement on this? I don't think there's any value to nullability here. (i'm acutely peeved because I had to write a bunch of if array.compressor is not None... checks over in another repo :)

d-v-b avatar Jun 01 '23 19:06 d-v-b

after discussion in the zarr community meeting, it looks like this aim is easier to accomplish in zarr-python, so I'm going to close this and open a new PR in zarr-python

d-v-b avatar Oct 04 '23 18:10 d-v-b