numcodecs
numcodecs copied to clipboard
add raw codec
This adds a raw codec that doesn't apply any transformation to the content of the input data. If we add this to numcodecs, then we can remove some "is the compressor a compressor or is it None
?" logic over in zarr-python
.
- [x] Unit tests and/or doctests in docstrings
- [ ]
tox -e py39
passes locally - [ ] Docstrings and API docs for any new/modified user-facing classes and functions
- [ ] Changes documented in docs/release.rst
- [ ]
tox -e docs
passes locally - [ ] GitHub Actions CI passes
- [ ] Test coverage to 100% (Coveralls passes)
This is a reasonable thing to do. Is "raw" the best name? It could be "noop" or "dummy", and I would add more text to the docstring explaining what it's for.
Or Identity
? Passthrough
? ...?
I'm open to name alternatives; "noop" might be confusing because there's a (tiny) cost associated with running encode
or decode
, so it's not literally a no-op. As for "dummy", I'm not sure if that fits since this is a real codec. It was originally called "identity", but I thought "raw" would be more intuitive to users coming from a storage / compression background.
Holding off here for further feedback.
Think Martin's point is raw
is unclear. So would suggest changing.
Since dummy
can have negative connotations in some contexts, would suggest we avoid that one.
No strong feelings about the other options.
What about NoneCodec
i.e. you're either going to get a None
object or a NoneCodec
implementation, both of which do nothing. This feels quite close to https://en.wikipedia.org/wiki/Null_object_pattern
Think one of the things we discussed the last community call is whether we can alleviate the need for having this at all. IOW have some way to specify no codec is used. This seems like the better solution. Would like a clearer understanding as to the hurdles blocking that today.
Definitely no objection to that, but would that be v3 only?
It's already possible in v2 to specify null
for no codec or no filter
Think I'm still missing context on why this is needed. If there is an issue in the spec, we should clarify it, but we may need help from @d-v-b to identify what the issue is so we can fix it
Think I'm still missing context on why this is needed. If there is an issue in the spec, we should clarify it, but we may need help from @d-v-b to identify what the issue is so we can fix it
The type Codec
is nicer than the type Union[Codec, None]
. If "do nothing" can be expressed as the action of a particular codec (e.g., this one), then we can use that to avoid instances of "if codec is None.... else....
in zarr-python
and potentially other places. null
can still be used as an alias for the JSON serialization of this codec.
Actually, isn't the codecs entry a List[Codec], which can be empty? This is how it is treated internally in v2 when we specify None, although I'm not immediately sure what is written in the .zarray file.
In zarr-python
, the compressor
property of an array is either a Codec
or None
: https://github.com/zarr-developers/zarr-python/blob/main/zarr/core.py#L264. With a NoneCodec / RawCodec
, we can remove the need for None
here.
Specifically, adding this codec will obviate the following "is the compressor None
or is it a Codec
?" checks in zarr-python
:
- https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L295
- https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L1856
- https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L2144
- https://github.com/zarr-developers/zarr-python/blob/dcc6ded4cb9b9fcc24e69b5fd5d898e97e0630dd/zarr/core.py#L2193
Thought the typing syntax was Optional[Codec]
(though one can write this other ways of course).
In any event given this is a typing
thing, am less inclined to add this. If it is helpful to have an alias to typing syntax, that seems ok.
Interesting didn't see your last comment yet. Will need to mull on that particular issue more. Though am still not sure adding a no-op codec is the solution
This is not merely a typing
thing -- this PR contains a real codec, not just type aliases :). As a side effect of creating this codec, we can simplify code in zarr-python
(which is the goal). I don't really care about type annotations, I'm focused on the actual types.
One could make the argument that a real no-op codec is a better example of how a codec should look that the current abstract base class :)
any chance of movement on this? I don't think there's any value to nullability here. (i'm acutely peeved because I had to write a bunch of if array.compressor is not None...
checks over in another repo :)
after discussion in the zarr community meeting, it looks like this aim is easier to accomplish in zarr-python
, so I'm going to close this and open a new PR in zarr-python