ceps

ceps copied to clipboard

Reame
Issues

[CEP XXXX] Identifying Packages and Channels in the conda Ecosystem

Open jaimergp opened this issue 9 months ago • 52 comments

Rendered CEP

Comes from https://github.com/conda/ceps/pull/115#discussion_r1989305660.

Addresses the long needed standardization of package names and channel names.

I took @beckermr's regexes and added some more details and structure. Very open to feedback in how descriptive vs prescriptive we should get. So far, only trying to describe roughly what we have in conda. I'd like to know more about how this is handled in rattler. cc @baszalmstra @wolfv

Mar 11 '25 20:03 jaimergp

I wonder if we should introduce maximum lengths across all categories, so we don't end up with something stupid like a package name of 50K characters.

Mar 12 '25 10:03 jaimergp

A maximum of 128 seems good to me.

Mar 12 '25 11:03 beckermr

Given the discussion above, it appears that treating the label separate from the channel is a mistake in the following sense.

A conda channel formally has a single set of repodata files associated with it (the platforms plus noarch).

In this sense, labels on anaconda.org are separate channels (as they have their own separate repodata). It is just that they are built out of subsets of packages from a parent channel (that subset denoted by the label).

The notation <channel>/label/<label> is basically a reserved pattern in the space of channels to refer to the results of this process.

So when we declare allowed channel names, we should not allow channel names like conda-forge/label/main since that would overlap with the actual main label on conda-forge.

I think the net result of all of this is that label is a reserved word in the space of channel names.

Mar 12 '25 18:03 beckermr

The requirement here on package names is more restrict than what is supported by conda and conda-build right now. Among other differences is that non-ASCII unicode characters are allowed packages names in conda and can be produced by conda-build including a test which checks this behavior.

If the requirements on package name are to be restricted it would be good if the CEP specified how violations are expected to be handled if at all.

Mar 12 '25 19:03 jjhelmus

Yep. Correct @jjhelmus. I've checked defaults and conda-forge for violations. IMHO tools should simply fail.

Mar 12 '25 19:03 beckermr

There are some edge cases where it is helpful to have concrete versions of virtual testing such as testing, debugging and locking, especially on non-native systems. For example conda-lock creates a local channel with realized virtual packages.

I don't believe this CEP prohibits the existence of these artifacts but it would be reasonable to include a note to this effect. That said concrete packages that have virtual packages name should not appear in general purpose channels.

Mar 12 '25 19:03 jjhelmus

Yep. Correct @jjhelmus. I've checked defaults and conda-forge for violations. IMHO tools should simply fail.

Should tools be actively checking for compatibility and if so at what level; creation, repodata, download/install?

As a concrete example should a ♥ package be buildable? included in repodata? installable?

For reference right now the ♥ package can be:

built by conda-build
indexed by conda-index
installed by conda using the classic solver

The following fail with the package:

Uploading to anaconda.org ([ERROR] ('package name ♥ not valid', 400))
Installing with the libmamba solver (UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 45: invalid continuation byte, could be an issue on my end)

Another option is to leave the behavior of these packages undefined, which IMHO is okay.

Mar 12 '25 20:03 jjhelmus

Given the discussion above, it appears that treating the label separate from the channel is a mistake in the following sense.

A conda channel formally has a single set of repodata files associated with it (the platforms plus noarch).

In this sense, labels on anaconda.org are separate channels (as they have their own separate repodata). It is just that they are built out of subsets of packages from a parent channel (that subset denoted by the label).

The notation <channel>/label/<label> is basically a reserved pattern in the space of channels to refer to the results of this process.

So when we declare allowed channel names, we should not allow channel names like conda-forge/label/main since that would overlap with the actual main label on conda-forge.

I think the net result of all of this is that label is a reserved word in the space of channel names.

Hm, maybe we can just consider this an implementation detail of anaconda.org. "Labels" are just a way of creating repodata.json files from a collection of conda artifacts. By default, main renders the repodata in the user/org channel. Changing that default label renders the repodata in user_or_org/label/the_new_label. But it could have been very well user_or_org/the_new_label. anaconda.org also supports label-less packages which go nowhere, so there's no need to establish that a label is always a subset of main, because something under a label doesn't have to be in main or anywhere really.

This begs the question of whether we want to standardize labels at all, or simply define an equivalent mechanism in the OCI mirror. Labels control repodata indexing by creating a different channel location for the selected packages. The labels can be applied after upload and publication. Since it concerns repodata generation, this has not been covered by the OCI CEP. We only need to figure out an OCI-equivalent way of doing this (one way could be "attached artifacts" via the referrers API)

In that regard, we don't need to standardize label names separately; only as part of what a channel name is. To that effect, I've added a short paragraph now.

PS: The more I read about the Referrers API (e.g. see Quay blog post), the more I see "channel labels" as a post-publication metadata annotation that should be handled separately from artifact identifiers.

Mar 12 '25 20:03 jaimergp

Yeah I am not so sure. The eventual CEP specifying what standard HTTP-based conda server looks like will want a way to specify that subsets of packages on a channel should be grouped in different ways. This feature is pretty essential to package maintenance.

We don't have to specify how repodata is generated from labels in this CEP, but I think we should reserve the namespace.

Mar 12 '25 20:03 beckermr

Oops, label is a valid channel name 😂 https://anaconda.org/label. Loving this:

Mar 12 '25 20:03 jaimergp

Oops, label is a valid channel name 😂 https://anaconda.org/label

As long as we specifiy the construct is "the channel name, followed by the string literal 'label', followed by the label name", I'm not sure that's actually a problem?

[Edit: that said, assume https://example.com/mychannel and https://anaconda.org/mychannel as necessarily the same channel might be.]

Mar 12 '25 20:03 chenghlee

The problem I see is with URLs like my.server.org/label/label/linux-64/linux-64/repodata.json. Is the channel name label, with label linux-64 and subdir linux-64? Is the channel name linux-64, and happens to be in a path that has two label components before it?

Mar 12 '25 21:03 jaimergp

We can specify parsing from right to left?

my.server.org/label/label/linux-64/linux-64/repodata.json

The last thing is repodata.json
So the subdir has to come next (it is "linux-64"). We're left with my.server.org/label/label/linux-64.
We parse to the next "label". If label is not found, then we have the channel name.
Otherwise, any path component we moved over to get to label is the label, and the channel is the rest after the label.
So the label is linux-64 in this case and channel is my.server.org/label.

Mar 12 '25 21:03 beckermr

We can specify parsing from right to left? ...

Give me a bit. I think I can write an EBNF grammar that will capture that.

Mar 12 '25 21:03 chenghlee

Then, for my.server.org/label/linux-64/linux-64/repodata.json (one less label path component), the server name is my.server.org, `` (empty), or label/linux-64?

Mar 12 '25 21:03 jaimergp

By the logic above, this string

my.server.org/label/linux-64/linux-64/repodata.json

would be parsed as

remove the repodata.json
subdir = linux-64
Splitting my.server.org/label/linux-64/ on the first label from the right produces (my.server.org, linux-64). So the channel is my.server.org and the label is linux-64.

I don't see why my.server.org can't be a valid channel?

Mar 12 '25 21:03 beckermr

I don't see why my.server.org can't be a valid channel?

Just tried conda search --override-channels -c http://localhost:8080 openblas, and that worked just fine, so I'd argue that https://my.server.org should be a perfectly valid channel. (On the assumption that "channel" basically means: all the URL components preceding noarch/repodata.json that conda will try to fetch to determine if something is a "valid" channel.)

What might break, though, is https://my.server.org/label/linux-64/repodata.json. Maybe. But, in that case, since label is not preceded by [another] label, I suppose the channel name is label, no label, and subdir name is linux-64.

Mar 12 '25 21:03 chenghlee

Yep @chenghlee. If you split on the label and the label you extract is the empty string, then there is no label and the channel is as you say, https://my.server.org/label.

Mar 12 '25 21:03 beckermr

Just to be very clear, in this spec the channel should include everything to the start of the URI. conda itself can strip the anaconda.org out of that bit for display purposes, but this is a UX feature.

Mar 12 '25 21:03 beckermr

I don't see why my.server.org can't be a valid channel?

Yea domain names (hosts) can be, but I think the conda CLI hides the hostname from the display name and uses an empty name (or a /?) as the display name. This is also the case with the configured channel_alias (conda.anaconda.org).

Mar 12 '25 21:03 jaimergp

Right. We can fix that in the cli. That is a display issue.

Mar 12 '25 21:03 beckermr

Just to be very clear, in this spec the channel should include everything to the start of the URI.

Including or excluding the scheme? I assume no authentication, but maybe we do have to encode the port? That means we need to accept slashes and colons too.

Mar 12 '25 21:03 jaimergp

From @chenghlee's test I think we keep everything including the ports, right?

Mar 12 '25 21:03 beckermr

That's what I think too, but then mapping that to OCI will present more challenges 😬

Mar 12 '25 21:03 jaimergp

Just to be very clear, in this spec the channel should include everything to the start of the URI.

Modulo the arguably pedantic distinctions between URIs (RFC 3986) and URLs (RFC 1738), I would agree with that. To support right to left parsing in any sensible way, I think "channel" should include everything else in the URL — scheme, host, port, user, password, etc. included.

That's what I think too, but then mapping that to OCI will present more challenges 😬

True; and I think there's maybe to definitely a separate discussion of what makes a "channel" unique. It's not just the OCI mapping that creates such semantic issues; e.g., (cf-staging -> conda-forge) and mirrors in various forms would have the same problems. To a large extent, I think this comes from the fact that conda packages are far more mobile across "channels" (whatever that may mean) than packages in other ecosystems.

Mar 12 '25 21:03 chenghlee

I need to read those rfcs. I don't know the difference myself.

Mar 12 '25 22:03 beckermr

I think we can defer some of the mapping issues to a spec on mirroring and a global index of channels.

I should reframe the OCI cep to simply specify what an OCI uri looks like, how it gets encoded to be used with an OCI instance, and not force it to be tied to specific things in other

Mar 12 '25 22:03 beckermr

The URL/URI RFCs plus WHATWG URLs is a fun read 😂

Mar 12 '25 22:03 jaimergp

I think we can defer some of the mapping issues to a spec on mirroring and a global index of channels.

Yep, a central registry like Wolf proposed in #91 is needed for that "channel identity" source of truth.

Mar 12 '25 22:03 jaimergp

I need to read those rfcs. I don't know the difference myself.

90% of the time, I forget what the exact distinctions are. But I'm also aware that someone taking a cursory glance at the example URIs in RFC 3986 could pedantically argue that tel:+1-816-555-1212 should be a valid channel, and none of us have the time, energy, or blood pressure medication to argue "absolutely not". :laughing:

Mar 12 '25 23:03 chenghlee