registry-dev Allowed versions and Trustees publishing

In https://github.com/purescript/registry/pull/76#discussion_r495608601 we figured that using SemVer's "prerelease segment" for Trustees to publish versions, as the way the spec orders them is not the one we'd like.

I'll report here the discussion from the thread, as a starting point for further discussion:

@hdgarrood: I'm not sure using the semver prerelease segment will work so well for this. Firstly, prerelease versions are considered less than non-prerelease versions according to semver, so v1.0.0 compares greater than v1.0.0-r1. Secondly, doesn't this prevent package authors from using the prerelease segment for their own purposes? I wouldn't really mind if we reserved the prerelease segment for our own use, but in that case we would be diverging from semver (if we don't allow authors to use all of the components of semver, we aren't really using semver) so I think we ought to be more upfront about that if that's what we go with.

@f-f: Great points. I wouldn't like us to diverge from SemVer, but other than that I have no strong opinion on how to do this. Do you have a concrete idea that we could write down here?

@hdgarrood: I just looked over the semver spec again and it does say that prerelease identifiers and build metadata are optional so actually I take that back, we are well within our right to only accept what the spec describes as "normal" versions into the registry (i.e. those which don't have prerelease identifiers or build metadata), and since it's quite rare to use prerelease identifiers and build metadata in practice (at least for libraries), I think it may even be a good idea to reject versions which use them. I also think it would be nice to set the registry architecture up so that it's impossible for revisions to affect anything other than the package metadata. What Hackage do is handle revisions of metadata for a version separately from the version tarball itself, so that there's only ever one tarball for a version, and revisions of the metadata are stored separately. It then becomes the job of the registry client to fetch the most recent version of the metadata alongside the package tarball. That approach sounds sensible to me, and I think it suggests that metadata revisions are separate from package versions. So maybe this is just a question for the package index?

@f-f: I am not quite comfortable with Hackage's approach of storing package sources and metadata revisions separately, as I think it has a couple of problems:

there is only one tarball for a version, which means that "hashing the package sources" doesn't guarantee integrity and security by itself, as you'd need to fetch a metadata file as well, and you'd need to guarantee integrity (i.e. hash it and distribute the hashes) either of this file or of the whole bundle. In the latter case, one then also needs to define how to compose the files to compute the hash. Preserving integrity of the metadata as well is necessary because things like "changing version bounds" could introduce malicious code if not supervised, and not guaranteeing integrity for them would mean that people would get them in their build believing that the version hasn't changed (because the hash did not), rendering builds nondeterministic and insecure.

doing this would mean that the registry-index repo would become a source of truth, which means that we'd either have to maintain these two repos in sync (e.g. the moment we add metadata in there we also need to sync the list of versions here, etc) or unify them

...but would storing metadata in here mean that packages wouldn't version that? If metadata is not stored alongside the package sources, then there would be no way to use a package without going through the Registry (and I can see some people not wanting to do that because of security or other company concerns), as they'd have to figure out how the package is defining dependencies, bounds, etc. If we store the metadata alongside the sources instead, then we have the question of "which metadata is correct, the one here or the one in the registry?". This last reason alone is why the current design ditches this aspect (i.e. storing manifests in here and optionally in packages) from the previous draft.

About prerelease identifiers: some packages have been using the pre-release segment (e.g. Halogen or aff) and I think it's good to allow them to, as it has an important role in package versioning.

I also read again through the SemVer spec, and found an issue which might actually help us here: it looks like it considers two version with everything equal but the build metadata as equals in the sorting. I.e. it ignores build metadata when sorting. This is a problem for us because we need get a stable sorting of releases (to figure out the last version of a package), so we would either have to:

disallow releases with build metadata altogether

or reserve the build metadata segment only for Trustees to cut new revisions (so going back to that idea from the previous version of the draft), and expand the sorting so that it would consider version with build metadata as "later" than without. The ordering of different build metadata segments should be no problem, as the spec already defines how to sort prerelease segments, and their grammar is the same, so we could just reuse that. This behaviour seems to be what apt does as well - distros patch upstream packages and add build metadata so that the package manager picks the patched versions as newer - and I think it works very well there

Note that both of the above options means that we'll slightly diverge from SemVer, but I'd consider this quite fine, since it's basically just getting rid of undefined behaviour.

@hdgarrood:

Aren't we intending to provide access to individual package manifest files without downloading a whole package tarball anyway, via the package index? Ensuring the integrity of those package manifest files is already a problem we'll need to deal with, surely?

In that case, could we store package manifest files in the storage backend alongside the package tarballs and call that the source of truth for them? Then, the package index would always be derived from the storage backend, and would not be a source of truth?

I think disallowing build metadata makes sense. We could also require that each package may only upload one version with the same major, minor, patch, and prerelease components. For example, if you've already uploaded 1.0.0+abc, then I think the registry should reject a subsequent upload of 1.0.0+def. I think defining an ordering for the build metadata would be a more serious violation of the semver spec, because the spec specifically says that you mustn't do that. If we go against this, I think it is likely to cause funny behaviour in clients which implement the semver spec accurately: for example, you couldn't have versions 1.0.0+abc and 1.0.0+def in a Set together. The only way I can make sense of the build metadata ordering requirement from the semver spec is if package registries should be refusing package uploads which differ from an existing version only in the build metadata component, i.e. uploading two different versions with the same major, minor, patch, and prerelease components should be disallowed.

@f-f: How would Trustees publish revisions if we disallow build metadata? And why would sorting by that a "serious violation"? SemVer doesn't say "you shouldn't do that because it's bad", it just says "we don't do that in SemVer". Registry clients are supposed to implement a spec that we define here. If we say "it's SemVer plus ordering by build metadata", then that is the spec.

@hdgarrood: By treating revisions as a separate thing from package versions? It’s a more serious violation in my mind because it’s not just filling a gap in the spec, it’s going against something the spec explicitly says. The versioning libraries that exist aren’t “semver plus build metadata,” they’re just semver.

Oct 07 '20 11:10 f-f

@hdgarrood to get back to your points:

if the package index is a "cache" then we don't have to ensure integrity and availability, which will be best-effort instead. I.e. if clients want additional assurance will check integrity by matching the content of the manifest inside the downloaded package with the manifest in the cache. This approach reduces operational burden for us (since things are "best-effort" we don't have to worry too much about them becoming unusable, and we have to handle it with much less care, since it can always be recreated from scratch) and at the same time opens up the possibility of having a third-party registry-index (e.g. for companies internal use, or whatever) to point package managers to.
this does not solve the issue that for every revision there would be two versions of the manifest: one inside the tarball and one outside. I propose instead that we keep the manifest only inside the tarball, which implies a new version (and a new tarball) when the manifest changes. This will prevent any confusion about what's the right manifest to use
treating revisions as a separate thing from package versions means that in practice we'd be diverging from SemVer quite a bit, and in my opinion in a more intrusive way than repurposing the build metadata segment as proposed above - because all things equal we'd have to define the grammar for it (while things like sorting and priority are things we'd need to spec in both cases)

Ultimately my main goal here is to just keep the spec as simple (maybe better worded as "uncomplected") as possible, which is why I'm leaning towards things like "stick closely to SemVer", and "keep a single source of truth: the tarball", and so on. This is so that all the pieces are easy to understand, develop, and maintain.

I agree with your point that repurposing the build metadata segment might not be nice since it's a divergence from vanilla SemVer, which is not desirable because e.g. we'd not be able to use whatever generic SemVer libraries to deal with the registry things, which in turn means more code to be written. So maybe a good approach could be to allow Trustees to publish new patch versions? This would allow us to stick to 100% SemVer and get a correct sorting. Note that in this way we'd not be in sync with git tags anymore, which would happen anyways if we go for https://github.com/purescript/registry/issues/16#issuecomment-715927722

Oct 24 '20 15:10 f-f

@hdgarrood do you have any more input on this, or should I move forward with the plan detailed in the last comment above?

Dec 18 '20 23:12 f-f

I still think having revisions separate from versions is preferable, and I don't think having manifests outside the tarball poses that much of a problem. I also don't agree that treating revisions differently from versions represents a divergence from SemVer; I'm not aware of anything in the SemVer spec that says we can't/shouldn't do that. Before I write another essay-length comment on that topic though, there's a more pressing issue, which is that I think handling metadata fixes with patch versions basically amounts to using yanking rather than revisions, whereas I think we previously agreed that revisions are more appealing than yanking. If you need to fix version bounds, and you publish that fix as a new patch version, then you need to tell the solver not to use the previous version (i.e. yank it). Also, if you make a release v1.0.0, and then you make a patch release v1.0.1 with a bug fix, you no longer have the option to revise the bounds for v1.0.0 if they turn out to be wrong (you can only release a new patch version, which then needs to include all of the other patch-level changes you've released in the meantime). You've probably seen lexi-lambda's reddit comment on revisions vs yanking before but here it is again just for reference.

Dec 19 '20 08:12 hdgarrood

which is that I think handling metadata fixes with patch versions basically amounts to using yanking rather than revisions

That's a good point, in trying to figure out a good solution for this I missed this general direction we were headed to. So let's go with revisions as you describe them, more specifically:

we'll add a revision : Optional Natural to the Manifest type (or should it be revision : Natural, should we always include it?)
we'll add to the spec a section about preferring higher revisions to lower revisions when picking package downloads

However I would still prefer to:

keep manifests inside the tarballs and never outside (except when caching them in the registry-index) - for the reasons I described above. This means creating a new tarball for every revision.
pin revisions when pinning a solved build plan - this means foregoing one of the advantages of revisions mentioned by Alexis in the comment you linked above, which is that "existing pinned plans don’t need to update those plans to accommodate the revisions, since they pinned to a version, not a revision". I would really like us to pin to revision, so that builds are truly reproducible over time and we don't serve to users different packages than what they solved in the first place (this is something that would really throw me off as a security-conscious user)

If all the above sounds good to you I'll go ahead and take care of updating the spec with these changes.

Dec 19 '20 16:12 f-f

That all sounds good to me. I think just using Natural for the revision field and giving it a value of 0 when there haven’t yet been any revisions sounds good - that way the first revision has revision = 1, which makes a bit more sense to me. Storing one tarball per revision sounds fine to me if we can guarantee that everything in the tarballs other than the manifest is the same. I think the question of whether to pin to revisions ought to be outside of the registry’s remit, that feels like a question for clients? Pinning to revisions in spago does sound fine to me too though.

Dec 19 '20 18:12 hdgarrood

Having implemented a part of this, I am now of the idea that having a separate revision field is a great deal of unnecessary complexity, and we should avoid it and use the build metadata instead. The hand-wavy reasoning is that by having a separate field we pay the price of dealing with structured data (separate fields/places to look for things, more complex data structures), while still having to deal with parsing unstructured data (as we store SemVer versions as simple strings). Better tradeoffs would be either only having structured data and no stringly-typed things, or only having a stringly-typed thing that we parse. The latter is much easier to achieve as the current ecosystem is built on that.

The best example of what I mean is probably here: https://github.com/purescript/registry/blob/3fab9a889d62c48d824476758067e5c7c5456690/v1/Metadata.dhall#L21 Here in the Metadata we store a mapping between something that represents "a tarball address" and their SHA256. If we have a revision field the key becomes a record (version + revision) - as you can see it's fine to represent this in Dhall, but it becomes impossible to represent it as a JSON object, as only strings can be object keys in JSON. If all the information about a version tarball is contained in the string then we can have a simple representation. Another occurrence of complexity is the part where we force storage backends to have logic to handle the revision scheme, come up with a naming scheme, encode all of that logic somewhere, and so on. That would disappear if we would instead have all the version information in a single place and have backends only deal with a string.

So I would like to go instead for the approach where we forbid authors from using SemVer's build-metadata field for their versions, and reserve it for trustees to publish new tarballs. We initially tought that this would (1) entail more implementation work and (2) go against SemVer's spirit, but it looks like none of these should be concerns: node-semver (which is arguably the most important implementation) provides a compareBuild function, which compares versions using also the build-metadata information. As you can see from the implementation it just applies the same sorting as the prerelease segment, making it easy to spec. I believe this route would be a net advantage in the amount of design/implementation work.

To clarify, this would entail:

using SemVer for all package versions
refusing publishing package versions that contain build metadata
except when such publish comes from a trustee intervention, in which case we would require the build metadata to be present in the version
use semver.compareBuild instead of semver.compare to sort package versions

Mar 22 '21 15:03 f-f

I'm still uneasy with that approach because client libraries are very likely to behave incorrectly if, say, the registry tells them that some package has two separate versions 1.0.0+0 and 1.0.0+1 available. I also think package authors should probably be able to make revisions of their own packages as well, not just trustees.

Mar 22 '21 15:03 hdgarrood

I think the purescript-versions package is probably a more important implementation of SemVer for our purposes than node-semver? Are you not using that already?

Mar 22 '21 15:03 hdgarrood

I'm still uneasy with that approach because client libraries are very likely to behave incorrectly if, say, the registry tells them that some package has two separate versions 1.0.0+0 and 1.0.0+1 available.

@hdgarrood Would you mind explaining this a little further? I'm not sure I'm seeing the problem you're envisioning and I'd like to make sure I'm understanding your point.

Mar 22 '21 16:03 thomashoneyman

Suppose I am writing a program which queries the registry. I ask the registry what the available version of purescript-blah are, and I get a list ["1.0.0+0", "1.0.0+1"]. I then parse that into a Set Version, using the purescript-versions library. Because the Ord instance for semver versions conforms to the semver spec and treats these two versions as equal, the resulting Set will just contain one element, and the question of which one stays in the set won't have a clear answer.

Mar 22 '21 16:03 hdgarrood

we'd detail in the spec that this is not strictly SemVer but a variant of it, so if you'd try to use purescript-versions and expect correct ordering that would not be compliant with the Registry spec
we shouldn't recommend for clients to query the registry directly. Package managers (and other clients) should be recommended to use the registry-index, which will contain a bunch of things including all the manifests for the versions of a package, and the correct sorting for the package versions

Mar 22 '21 17:03 f-f

@hdgarrood addressing some of your other concerns:

package authors do not need to publish revisions at all since they can just cut a new patch release
we're not using any SemVer implementation yet. I was looking at node-semver as it's probably the most widespread implementation (so it's more battle-tested, etc) and we don't need much logic (just parsing and sorting). I'm of course open to use purescript-versions if it implements what we need for the Registry purposes

Mar 23 '21 13:03 f-f

I think not giving package authors the ability to make a revision somewhat defeats the point of implementing revisions, since requiring people to do a patch version for these changes basically amounts to yanking instead; see eg my previous comment in this thread https://github.com/purescript/registry/issues/80#issuecomment-748442733. I also worry that on a social level it might not go down too well for package authors to not have full control over their own packages in comparison to trustees. (Of course there will always be some privileges which are reserved to trustees, eg removing packages if they are found to contain malicious code, but those should ideally only be exercised in exceptional circumstances I think.)

re purescript-versions vs node-semver, I was just thinking it would be nice to have a pure PureScript implementation, and purescript-versions has been in use for a few years in Pulp and I'm pretty confident that what's there currently is correct, so I just wanted to let you know it's there. It doesn't implement version ranges yet though. If you end up going with node-semver that's totally fine by me.

Mar 23 '21 13:03 hdgarrood

(I want to add that if you need to not worry about all of these issues I'm raising for now in order to ensure that this gets shipped in a reasonable timeframe, that's totally fine by me and I trust you to make that call.)

Mar 23 '21 13:03 hdgarrood

requiring people to do a patch version for these changes basically amounts to yanking instead

In my understanding yanking is when there's a way to tell the solver "don't use this". But "package authors publishing a new version" is just, well, authors publishing a new version: they are free to put whatever thing in there. Of course "authors asking trustees to cut a revision" is also a fine thing, and we could also implement some automation to allow that to happen without human intervention.

I also worry that on a social level it might not go down too well for package authors to not have full control over their own packages in comparison to trustees

In also understand this concern, but I think that with clear enough rules, stated clearly enough to package authors (e.g. giving out a link to them when publishing a package) then this might be less of an issue? Do you think there's something else we can do to mitigate this?

I want to add that if you need to not worry about all of these issues I'm raising for now in order to ensure that this gets shipped in a reasonable timeframe, that's totally fine by me and I trust you to make that call.

Thank you, I appreciate :slightly_smiling_face: I've been thinking about this in the last days - about how we can ship now while leaving the door open for improvements later - and I think that we are going to be forwards compatible if we:

forbid build metadata in published versions (so we can eventually reserve it for Registry purposes)
do not include a separate revision field anywhere (so we can eventually introduce it as an optional field as I did in my WIP branch)

@hdgarrood does this make sense to you? (cc @thomashoneyman)

Mar 23 '21 13:03 f-f

I think if there's a path to letting package authors request revisions without human intervention then that addresses my concern. Forbidding build metadata and not including a separate revision field to start with sounds good to me :+1:

Mar 23 '21 15:03 hdgarrood

As of https://github.com/purescript/registry/pull/310 and https://github.com/purescript/registry/pull/305 packages are not allowed to have build metadata or prerelease identifiers. There is nothing in the spec so far about revisions using something like build metadata, but it's something we could work on.

Jan 29 '22 06:01 thomashoneyman

We should move forward with this, so it's in place for when we need it.

The rough execution plan looks like this:

we'd add a policy document, probably/possibly embedded in the Spec, along the lines of Haskell's Trustees policy
we'd then spec and implement an Operation for Trustees to patch a package (dependencies only or including code changes, depending on where we settle policy-wise). These would go through GitHub only - as the Package Set ones do - for visibility.
finally we'd change to the Version datatype to add the Build Metadata (from SemVer), and only allow Trustees-published versions to have the special-Registry-build-metadata. This will require changing the sorting algorithm to include it so that things sort correctly, which is a divergence from SemVer. This would be an opportunity to reintroduce prereleases with correct sorting, which is out of the scope of this issue but I think we should discuss nonetheless as during the alpha a few queries came up about it.

Oct 11 '23 20:10 f-f

This would be an opportunity to reintroduce prereleases with correct sorting

Are you proposing that we re-introduce support for pre-releases?

Oct 11 '23 22:10 thomashoneyman

Are you proposing that we re-introduce support for pre-releases?

It's not an official proposal yet but I have warmed up to the idea after seeing the Registry deployed 😄

Oct 12 '23 05:10 f-f

registry-dev registry-dev copied to clipboard

Allowed versions and Trustees publishing

registry-dev
registry-dev copied to clipboard