ceps icon indicating copy to clipboard operation
ceps copied to clipboard

[CEP XXXX] OCI Storage of Conda Artifacts

Open beckermr opened this issue 9 months ago • 13 comments

This draft CEP has an updated specification for storing conda packages as OCI artifacts. It is an updated form of the specification in PR #70, given the feedback on the previous PR.

Rendered CEP

beckermr avatar Mar 11 '25 12:03 beckermr

Good catch @jaimergp! I reran my notebooks with _ going to _U and everything works great.

Interestingly enough, we apparently have no build strings with a double underscore in either defaults or conda-forge! All OCI-encoded conda artifact names I produced using _ -> __ from both of those channels passed the OCI regexes from the Distribution Spec. Fun!

beckermr avatar Mar 11 '25 14:03 beckermr

Thanks for the additional comments @jaimergp! Any thing else you can see?

beckermr avatar Mar 11 '25 18:03 beckermr

Nothing too big, just a couple of observations:

  • The copyright statement needs to be put back in.
  • A References section that compiles the different URLs mentioned would be welcome.
  • I would wait until this CEP is approved to assign a number. There are a few ongoing PRs that might get voted before this one and then it would be confusing. e.g. there's a PR named "CEP 17", but CEP 17 ended up being this one. This might be reflecting a problem in how we mint CEP numbers. Happy to discuss further!

jaimergp avatar Mar 11 '25 19:03 jaimergp

Ah if we don't plan to assign numbers can we adjust the linter then? It being red but only due to the number is pretty annoying since it also spell checks, lints, etc.

beckermr avatar Mar 11 '25 21:03 beckermr

~Ack @jaimergp, our analysis of the double underscore issues was wrong. Build strings, version strings, and labels go in OCI tags which allow any number of underscores in a row~

~~ ~^[a-zA-Z0-9_][a-zA-Z0-9._-]{0,127}$~ ~~

~So we can and should use the _ -> __ encoding to make things visually cleaner.~

Never mind. While I do think the above is true, I'd rather keep the _U.

beckermr avatar Mar 11 '25 21:03 beckermr

OK @jaimergp This is ready for one more look.

beckermr avatar Mar 11 '25 21:03 beckermr

Comment from @schuylermartin45: move to better hash to avoid issues with hash-collision attacks

Comment from @Callek: move to SHA256 as a compromise

beckermr avatar Mar 12 '25 17:03 beckermr

i've pushed a huge update @jaimergp. comments welcome!

beckermr avatar Mar 13 '25 21:03 beckermr

Looking more at the OCI spec and the working code from conda-OCI-mirror, we'll need to specify a few more things including

  • what we do with the top-level config
  • any attributes attached to each layer
  • how we set the top-level artifact mediaType

beckermr avatar Mar 15 '25 18:03 beckermr

pre-commit.ci autofix

beckermr avatar Mar 15 '25 22:03 beckermr

I was thinking on this more and I think we should not use the m prefix and instead disallow current repodata on OCI channels. My reasons are

  • Older conda clients that had performance improvements from current repodata won't be able to read OCI channels directly anyways. They can fall back to repodata and the client should be upgraded anyways.
  • Older conda clients that access an OCI channel via a web proxy could request current_repodata.json and the web proxy could translate to repodata_current.json for use with an OCI channel if we wanted.
  • The CEP in #116 specifies that the URL of a conda channel needs to have <channel base URL>/noarch/repodata.json as a valid address. As long as we stick to the tag "latest" being the current most recent image, then we can meet this spec.
  • We will likely save ourselves some pain by only having to deal with prefixing package names with c as opposed to having to prefix everything.

Thoughts @jaimergp?

beckermr avatar Mar 31 '25 14:03 beckermr

All of that is currently valid and sound. I'm just worried that we are being lucky now with no conflicts in the filename prefixes (I do second dropping current_repodata.json, we don't need it these days, and we can consider it an Anaconda.org implementation detail if there's such a need).

I just don't see the pain in prefixing metadata files with m. If we don't, we might run into a situation where we want to add a new type of files, and the only way out would be to add OCI-specific sub-subdirs like conda-forge/linux-64/packages/ and conda-forge/linux-64/metadata/, and that seems like more painful.

But if you think the burden of prefixing m to everything is not worth it, I won't block it. I'd be happy to hear what others think too.

jaimergp avatar Mar 31 '25 17:03 jaimergp

Hmmm. Maybe the right thing is to distinguish between the abstract url given to conda versus the storage location on disk. We can specify that more clearly in the CEP.

beckermr avatar Mar 31 '25 18:03 beckermr