[CEP XXXX] Identifying Packages and Channels in the conda Ecosystem
Rendered CEP
Comes from https://github.com/conda/ceps/pull/115#discussion_r1989305660.
Addresses the long needed standardization of package names and channel names.
I took @beckermr's regexes and added some more details and structure. Very open to feedback in how descriptive vs prescriptive we should get. So far, only trying to describe roughly what we have in conda. I'd like to know more about how this is handled in rattler. cc @baszalmstra @wolfv
I wonder if we should introduce maximum lengths across all categories, so we don't end up with something stupid like a package name of 50K characters.
A maximum of 128 seems good to me.
Given the discussion above, it appears that treating the label separate from the channel is a mistake in the following sense.
A conda channel formally has a single set of repodata files associated with it (the platforms plus noarch).
In this sense, labels on anaconda.org are separate channels (as they have their own separate repodata). It is just that they are built out of subsets of packages from a parent channel (that subset denoted by the label).
The notation <channel>/label/<label> is basically a reserved pattern in the space of channels to refer to the results of this process.
So when we declare allowed channel names, we should not allow channel names like conda-forge/label/main since that would overlap with the actual main label on conda-forge.
I think the net result of all of this is that label is a reserved word in the space of channel names.
The requirement here on package names is more restrict than what is supported by conda and conda-build right now. Among other differences is that non-ASCII unicode characters are allowed packages names in conda and can be produced by conda-build including a test which checks this behavior.
If the requirements on package name are to be restricted it would be good if the CEP specified how violations are expected to be handled if at all.
Yep. Correct @jjhelmus. I've checked defaults and conda-forge for violations. IMHO tools should simply fail.
There are some edge cases where it is helpful to have concrete versions of virtual testing such as testing, debugging and locking, especially on non-native systems. For example conda-lock creates a local channel with realized virtual packages.
I don't believe this CEP prohibits the existence of these artifacts but it would be reasonable to include a note to this effect. That said concrete packages that have virtual packages name should not appear in general purpose channels.
Yep. Correct @jjhelmus. I've checked defaults and conda-forge for violations. IMHO tools should simply fail.
Should tools be actively checking for compatibility and if so at what level; creation, repodata, download/install?
As a concrete example should a ♥ package be buildable? included in repodata? installable?
For reference right now the ♥ package can be:
- built by conda-build
- indexed by conda-index
- installed by conda using the classic solver
The following fail with the package:
- Uploading to anaconda.org (
[ERROR] ('package name ♥ not valid', 400)) - Installing with the libmamba solver (
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 45: invalid continuation byte, could be an issue on my end)
Another option is to leave the behavior of these packages undefined, which IMHO is okay.
Given the discussion above, it appears that treating the label separate from the channel is a mistake in the following sense.
A conda
channelformally has a single set of repodata files associated with it (the platforms plus noarch).In this sense,
labelson anaconda.org are separate channels (as they have their own separate repodata). It is just that they are built out of subsets of packages from a parent channel (that subset denoted by the label).The notation
<channel>/label/<label>is basically a reserved pattern in the space of channels to refer to the results of this process.So when we declare allowed channel names, we should not allow channel names like
conda-forge/label/mainsince that would overlap with the actualmainlabel onconda-forge.I think the net result of all of this is that
labelis a reserved word in the space of channel names.
Hm, maybe we can just consider this an implementation detail of anaconda.org. "Labels" are just a way of creating repodata.json files from a collection of conda artifacts. By default, main renders the repodata in the user/org channel. Changing that default label renders the repodata in user_or_org/label/the_new_label. But it could have been very well user_or_org/the_new_label. anaconda.org also supports label-less packages which go nowhere, so there's no need to establish that a label is always a subset of main, because something under a label doesn't have to be in main or anywhere really.
This begs the question of whether we want to standardize labels at all, or simply define an equivalent mechanism in the OCI mirror. Labels control repodata indexing by creating a different channel location for the selected packages. The labels can be applied after upload and publication. Since it concerns repodata generation, this has not been covered by the OCI CEP. We only need to figure out an OCI-equivalent way of doing this (one way could be "attached artifacts" via the referrers API)
In that regard, we don't need to standardize label names separately; only as part of what a channel name is. To that effect, I've added a short paragraph now.
PS: The more I read about the Referrers API (e.g. see Quay blog post), the more I see "channel labels" as a post-publication metadata annotation that should be handled separately from artifact identifiers.
Yeah I am not so sure. The eventual CEP specifying what standard HTTP-based conda server looks like will want a way to specify that subsets of packages on a channel should be grouped in different ways. This feature is pretty essential to package maintenance.
We don't have to specify how repodata is generated from labels in this CEP, but I think we should reserve the namespace.
Oops,
labelis a valid channel name 😂 https://anaconda.org/label
As long as we specifiy the construct is "the channel name, followed by the string literal 'label', followed by the label name", I'm not sure that's actually a problem?
[Edit: that said, assume https://example.com/mychannel and https://anaconda.org/mychannel as necessarily the same channel might be.]
The problem I see is with URLs like my.server.org/label/label/linux-64/linux-64/repodata.json. Is the channel name label, with label linux-64 and subdir linux-64? Is the channel name linux-64, and happens to be in a path that has two label components before it?
We can specify parsing from right to left?
my.server.org/label/label/linux-64/linux-64/repodata.json
- The last thing is repodata.json
- So the subdir has to come next (it is "linux-64"). We're left with
my.server.org/label/label/linux-64. - We parse to the next "label". If
labelis not found, then we have the channel name. - Otherwise, any path component we moved over to get to
labelis the label, and the channel is the rest after the label. - So the label is
linux-64in this case and channel ismy.server.org/label.
We can specify parsing from right to left? ...
Give me a bit. I think I can write an EBNF grammar that will capture that.
Then, for my.server.org/label/linux-64/linux-64/repodata.json (one less label path component), the server name is my.server.org, `` (empty), or label/linux-64?
By the logic above, this string
my.server.org/label/linux-64/linux-64/repodata.json
would be parsed as
- remove the repodata.json
- subdir = linux-64
- Splitting
my.server.org/label/linux-64/on the firstlabelfrom the right produces (my.server.org,linux-64). So the channel ismy.server.organd the label islinux-64.
I don't see why my.server.org can't be a valid channel?
I don't see why
my.server.orgcan't be a valid channel?
Just tried conda search --override-channels -c http://localhost:8080 openblas, and that worked just fine, so I'd argue that https://my.server.org should be a perfectly valid channel. (On the assumption that "channel" basically means: all the URL components preceding noarch/repodata.json that conda will try to fetch to determine if something is a "valid" channel.)
What might break, though, is https://my.server.org/label/linux-64/repodata.json. Maybe. But, in that case, since label is not preceded by [another] label, I suppose the channel name is label, no label, and subdir name is linux-64.
Yep @chenghlee. If you split on the label and the label you extract is the empty string, then there is no label and the channel is as you say, https://my.server.org/label.
Just to be very clear, in this spec the channel should include everything to the start of the URI. conda itself can strip the anaconda.org out of that bit for display purposes, but this is a UX feature.
I don't see why my.server.org can't be a valid channel?
Yea domain names (hosts) can be, but I think the conda CLI hides the hostname from the display name and uses an empty name (or a /?) as the display name. This is also the case with the configured channel_alias (conda.anaconda.org).
Right. We can fix that in the cli. That is a display issue.
Just to be very clear, in this spec the channel should include everything to the start of the URI.
Including or excluding the scheme? I assume no authentication, but maybe we do have to encode the port? That means we need to accept slashes and colons too.
From @chenghlee's test I think we keep everything including the ports, right?
That's what I think too, but then mapping that to OCI will present more challenges 😬
Just to be very clear, in this spec the channel should include everything to the start of the URI.
Modulo the arguably pedantic distinctions between URIs (RFC 3986) and URLs (RFC 1738), I would agree with that. To support right to left parsing in any sensible way, I think "channel" should include everything else in the URL — scheme, host, port, user, password, etc. included.
That's what I think too, but then mapping that to OCI will present more challenges 😬
True; and I think there's maybe to definitely a separate discussion of what makes a "channel" unique. It's not just the OCI mapping that creates such semantic issues; e.g., (cf-staging -> conda-forge) and mirrors in various forms would have the same problems. To a large extent, I think this comes from the fact that conda packages are far more mobile across "channels" (whatever that may mean) than packages in other ecosystems.
I need to read those rfcs. I don't know the difference myself.
I think we can defer some of the mapping issues to a spec on mirroring and a global index of channels.
I should reframe the OCI cep to simply specify what an OCI uri looks like, how it gets encoded to be used with an OCI instance, and not force it to be tied to specific things in other
The URL/URI RFCs plus WHATWG URLs is a fun read 😂
I think we can defer some of the mapping issues to a spec on mirroring and a global index of channels.
Yep, a central registry like Wolf proposed in #91 is needed for that "channel identity" source of truth.
I need to read those rfcs. I don't know the difference myself.
90% of the time, I forget what the exact distinctions are. But I'm also aware that someone taking a cursory glance at the example URIs in RFC 3986 could pedantically argue that tel:+1-816-555-1212 should be a valid channel, and none of us have the time, energy, or blood pressure medication to argue "absolutely not". :laughing: