zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Define the list of codecs in the v3 spec

Open d-v-b opened this issue 1 year ago • 8 comments

Follow-up from #293

The way the v3 spec describes codecs today is inconsistent with the needs of a large number of users and implementers. Specifically, the spec says that there is no formal list of official codecs, but a number of people believe that a formal list of official codecs is either the true intention of the spec, or simply a much better specification than the alternative.

Changes in this PR:

  • removes the language stating that the spec does not list any codecs. Said language is replaced with a definition of a "core codec" (e.g., the codecs that are de facto part of the spec today), as well as a definition of a "community codec" (e.g., a codec published by a third party).
  • Zarr implementations ~MUST~ SHOULD support core codecs
  • Zarr implementations MAY support community codecs.
  • The requirement that codec identifiers be URIs that resolve to a human-readable specification has been removed, as we weren't even following that ourselves and it's not machine-checkable by definition.

Happy to take feedback, in particular on the styling. I'm bad at rst.

I think it's worth considering the application of the same "core" vs "community" distinction for other extension points, e.g. data types.

d-v-b avatar Sep 11 '24 09:09 d-v-b

Does blosc really need to be a necessary codec? I feel as though Gzip is enough for a bytes->bytes core codec since its widely available. Blosc might not have bindings in many languages, making it difficult to be supported by some implementations. For example, zarr-ml cannot support blosc since as of today there aren't any OCaml bindings to the blosc C library. On the other hand, we can confidently assume nearly all modern languages have support for Gzip/Zlib given that it has been around for over 30 years.

zoj613 avatar Sep 11 '24 20:09 zoj613

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

d-v-b avatar Sep 11 '24 20:09 d-v-b

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

I see. I wasn't aware of that discussion. I would like to also caution against defining a spec based on the features offered by the python implementations of a particular codec. For example, I noted here that the Zstd spec includes a checksum parameter that does not seem to be supported by many Zstd implementations across different languages. It appears to me that the parameter is influenced by the feature offered by the Zstd python implementation even though many other implementations do not seem to support it. This would make it difficult to full support that spec across Zarr implementations

zoj613 avatar Sep 11 '24 20:09 zoj613

@zoj613 would you mind bringing up your concerns in #293 for visibility?

d-v-b avatar Sep 11 '24 20:09 d-v-b

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

d-v-b avatar Sep 11 '24 20:09 d-v-b

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

This is in line with my thoughts in #293:

A zarr implementation does not need to support every codec to be conformant

and @normanrz:

Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes codec. Other codecs are so popular and general that all implementations should implement it

LDeakin avatar Sep 11 '24 21:09 LDeakin

I think this is ready for review. I removed the requirement that zarr implementations MUST support the core codecs. instead, they SHOULD support them.

d-v-b avatar Sep 12 '24 11:09 d-v-b

@zarr-developers/steering-council could someone have a look at this?

d-v-b avatar Sep 12 '24 14:09 d-v-b

@zarr-developers/steering-council I really would appreciate some feedback on this PR. It has been ready for review for 3 weeks with no response. What is the blocker here?

For context, in zarr-python there are efforts to implement v3 versions of zarr v2 compression routines. The spec as currently written does not offer clear guidance for this process, and I believe that the changes contained in this PR will offer much needed clarifications that will be of immediate use to at least one implementation (zarr-python). If the strategy for codecs outlined in this PR is viable, then we can apply the same approach to datatypes.

d-v-b avatar Oct 03 '24 21:10 d-v-b

@d-v-b did you decide to stick with "community" over "extension"? I weakly favor "extension," but I don't want to block this PR on that point.

rabernat avatar Oct 07 '24 13:10 rabernat

i'm happy to switch over to "extension" but i haven't changed the text yet

d-v-b avatar Oct 07 '24 13:10 d-v-b

"community" is out, "extension" is in.

d-v-b avatar Oct 08 '24 05:10 d-v-b

Few comments:

  • Thanks to @d-v-b for all the clean up!
  • On the MUST/SHOULD front (link), it might be good to make sure folks from sign off if it's reversing a previous decision.
  • As I mentioned to @d-v-b at a recent meeting, I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

joshmoore avatar Oct 09 '24 20:10 joshmoore

  • I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors. If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid. In fact, the entire point of this PR is to provide a path for members of the community to publish extension codecs with as little work from spec maintainers as possible. This is the spirit of the following text from the PR:

Publication in the "extension codecs" section does not confer primacy or an official designation to a codec. The list of extension codecs exists expressly as a tool to enable coordinated codec development.

Also, your requested recommendation is not so simple. We cannot simply recommend that codec developers avoid breaking changes to their codecs -- a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes.

It is exactly because this matter is complicated that I do not wish to comment on it in this PR, hence my use of limited language above. If we are to make recommendations to codec developers (and managing versioning is just one of many possible recommendations), it should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one.

d-v-b avatar Oct 09 '24 20:10 d-v-b

@normanrz @dstansby @LDeakin @zoj613 thoughts on the changes proposed in this PR? In particular the statement that implementations SHOULD support the core codecs, but also anything else.

d-v-b avatar Oct 09 '24 20:10 d-v-b

This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors.

Agreed but ...

If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid.

I didn't say "enforce", but I do think it's our job to communicate preferences where possible since codec implementers can't read our minds.

We cannot simply recommend that codec developers avoid breaking changes to their codecs...

I disagree. We can definitely recommend; we can't require.

a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes.

That's an interesting point. If we think that's feasible, then I would include that in our discussion. An expert codec developer will likely have thought of the above and will not need our recommendation, but for the rest of the world, we should provide what guidance we can or at least a description of trade-offs that we are aware of. And I'd still say that codec breakage (without versioning) is something to be avoided for the community.

Tt should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one.

:+1: though I think non-normative statements can easily be kept in this document until we have such a best practices guide.

joshmoore avatar Oct 09 '24 20:10 joshmoore

I didn't say "enforce", but I do think it's our job to communicate preferences where possible since codec implementers can't read our minds.

Here is the crux of the matter: it is simply not my preference that authors of extension codecs avoid breaking changes to their codecs. In my view, they can do what they want with their codecs, except use the name of a core codec. The whole point of this PR is that we are hosting the specification of their codecs, and nothing else. You are asking for us to request or recommend something else, i.e. your personal preference that those developers adopt certain development practices. That preference is, as we see here, not simple and the subject of debate.

In the interest of getting the core features of this PR through, I would request that we not attempt to resolve a potentially complicated discussion about best practices for codec development with this PR. Lets find a different venue for that content. We need the core elements of this PR to host specifications for string datatypes in zarr v3. Pending additional evidence, we do not need development recommendations for codec developers at this time.

d-v-b avatar Oct 09 '24 21:10 d-v-b

A few last points for tonight:

it is simply not my preference

To be clear, I'm not talking about you or your preferences specifically anywhere though I value your insights in these matters. It's about communicating to new devs with a common voice what we believe will lead to the most valuable ecosystem; they can always ignore us.

The whole point of this PR is that we are hosting the specification of their codecs, and nothing else

As I mentioned on call, I think you and I differ here in that as soon as we accept in the PR to list the community extension I think there are expectations, and we can keep trying to communicate that we take no responsibility, but it's probably as easy to list some best practices to avoid the worst mistakes.

I would request that we not attempt to resolve a potentially complicated discussion about best practices for codec development with this PR...

Understood. I don't necessarily agree that we want to wait for the evidence of having broken data to make a recommendation nor that it warrants a separate page, but I accept that you don't want to take responsibility for those changes.

P.S. I have some inkling that this PR and #293 point to differing understandings of what codecs are and/or how they should be handled (naming, etc.), but I can't put my finger on it, certainly not at this hour...

joshmoore avatar Oct 09 '24 21:10 joshmoore

@normanrz @dstansby @LDeakin @zoj613 thoughts on the changes proposed in this PR? In particular the statement that implementations SHOULD support the core codecs, but also anything else.

It all looks good to me 👍

Breaking changes/versioning

+1 for everything stated in https://github.com/zarr-developers/zarr-specs/pull/312#issuecomment-2403397054 by @d-v-b.

If I were to propose an extension codec for one of the experimental codecs in zarrs, I would seek feedback during a PR and provide stability guarantees on merge. But I don't think that should be mandatory.

I think it would be perfectly fine for extension codecs to include a disclaimer like the following (if needed):

This codec may receive breaking changes and is recommended for evaluation only. This revision is supported by XX implementations.

LDeakin avatar Oct 09 '24 22:10 LDeakin

Thanks, @LDeakin, it's a good example of what a developer would realistically want to communicate.

I think what I'm struggling with is that there seem to be more classes of extensions and we're trying to map a subset. Trying to describe the classes without using any of terms we've used so far:

Class A B C D E F
Confidence 100% 80% 60% 40% 20% 0%
Naming Unique and "special" Unique and "special" (?) Unique Unique Unique in "registry" No guarantees
Schema zarr-specs zarr-specs zarr-specs zarr-specs Developer's page (if at all) Developer's page (if at all)
Dicovery zarr-specs zarr-specs zarr-specs zarr-specs zarr-specs PyPi, etc.
Versioning ZEP-controlled ZEP-controlled Recommendations Unknown Unknown Unknown
Support All implementations Hopefully most implementations Unknown Unknown Unknown Unknown
Comment/Like "registry" "entrypoint"

(Note: I'm not proposing these classes for inclusion in the spec just trying to assign names so we can discuss previous and current state without getting confused)

I think if you were to have asked me before this PRs conversation I would have said "core codecs" are (A) and "extensions" are (C), and I would say if there are no guarantees on a codec then for me it's closer to (E), i.e. grab your name and go, or the current situation of (F).

I think this proposal is closer to suggesting (B) and (D), but happy to take input on the table to see if we're thinking about these things in the same way.

A reason I think this is important is the relationship to other future extensions. Would we feel comfortable with a similar level of confidence for an "extension data type" that we list in zarr-specs?

joshmoore avatar Oct 10 '24 11:10 joshmoore

The goal of this PR is not to carefully map the space of codecs. We should do that somewhere else. The goal of this PR is to clarify the language of the spec in a useful way for codec development, in particular solving two acute, practical problems:

  • defining the status of the codecs that are currently listed in the spec (the "core codecs")
  • defining a minimal framework for publishing codecs that are not listed in the spec ("extension codecs") that solves an active problem: over in zarr-python we are back-porting zarr v2 codecs into zarr v3, and we (the zarr community) needs a place to publish the specs for those codecs so that other implementations can work with them.

So, for the purposes of this PR, there are only two classes of codecs under consideration: those listed in the spec ("core codecs"), and those not listed in the spec ("extension codecs").

For the codecs currently listed in the spec (core codecs), this PR addresses the following question: SHOULD implementations support those codecs, or MUST they support those codecs?

The answer I have reached in this PR is a decisive SHOULD. Suppose someone wants to write a Zarr implementation in a language X that doesn't have any bindings to blosc. If the spec requires that all Zarr implementations MUST support the core codecs, then language X cannot have a valid Zarr implementation until blosc bindings are added, and this is an unfair burden on members of the language X ecosystem. Crucially, this is not a hypothetical scenario. Because the alternative would cause serious problems, this PR alters the spec to state that the core codecs SHOULD be supported by Zarr implementations.

For codecs not listed in the spec (extension codecs), this PR addresses the following question: what is the relationship between extension codecs and core codecs? (answer: they have to use a different name from the core codecs). This PR also defines a service for zarr community members, wherein the zarr spec developers will host their codec. But, as stated in the PR, being listed doesn't confer any special status to an extension codec.

This PR represents my attempt to make the minimal amount of changes to resolve serious ambiguity in the spec document which is a barrier for codec development. If there are proposed changes to this PR that can better solve the problems I am trying to solve then please suggest them.

A reason I think this is important is the relationship to other future extensions. Would we feel comfortable with a similar level of confidence for an "extension data type" that we list in zarr-specs?

I think the changes I am proposing in this PR could work for data types as well. I would use the same two classes, with the same rules for members of each class.

d-v-b avatar Oct 10 '24 12:10 d-v-b

Aside from my comments above, I wanted to say that I am 👍 this change overall, and thanks a bunch for opening a concrete proposal to push this forward.

dstansby avatar Oct 10 '24 12:10 dstansby

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

joshmoore avatar Oct 10 '24 12:10 joshmoore

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

I don't know what the disagreement or conflict is -- what changes do you specifically want to see in this PR? We have discussed adding recommendations to codec developers, but since we disagree on the content of that recommendation I think it's out of scope for this PR, and should probably be contained in a separate PR; I'd be happy to continue that particular conversation in such a PR.

d-v-b avatar Oct 10 '24 13:10 d-v-b

we disagree on the content of that recommendation I think it's out of scope for this PR

The fact that we disagree means that we will likely be having this conversation in any follow up PR, so unfortunately, I'd say it's in scope until we at least agree on the definition of what's in this PR.

Questions I have are:

  • core
    • Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.
  • extensions
    • what's the stability of "extension codecs"? (C vs D)
    • should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

joshmoore avatar Oct 10 '24 13:10 joshmoore

Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.

no, pending feedback of course.

what's the stability of "extension codecs"? (C vs D)

should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs repo. I have labored to keep this PR simple, but it feels to me that you are broadly suggesting that I make this PR more complex. Given that nothing in this PR precludes additional efforts to refine our definition of extension codecs, I would ask that we put those efforts and that conversation in a separate venue, e.g. an issue or a PR.

d-v-b avatar Oct 10 '24 13:10 d-v-b

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

joshmoore avatar Oct 10 '24 13:10 joshmoore

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

"codecs in the spec" and "codecs not in the spec" are concepts that exist in the spec already -- I am merely giving these concepts names in my PR.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

Why does it make you uncomfortable?

d-v-b avatar Oct 10 '24 14:10 d-v-b

Rather than reiterate my points from above, I'll add that @rabernat's suggestion of a maturity level from https://stac-extensions.github.io/ lessened the nervous. So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR? Perhaps as with @dstansby's suggestion, we try to say less rather than trying to work out the definitions at this point.

joshmoore avatar Oct 11 '24 08:10 joshmoore

So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR?

What specifically in this PR closes the door on follow-up PRs?

d-v-b avatar Oct 11 '24 08:10 d-v-b