purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

Reasons for github, bitbucket and generic types?

Open jdillon opened this issue 5 years ago • 16 comments

I was just looking over the spec again and noticed we have these non-type specific generic formats:

  • github
  • bitbucket
  • generic

These do not actually express what the package is and I thing these types are an anti-pattern.

You could imaging that all other types could be expressed as these, but that does not help to indicate what the type of that thing is, only where it comes from as some opaque binary, or in the case of "generic" that its just some named octet-stream. I think these are perversions of the intended nature of this specification and should be removed.

You could consider maybe you have a type that is http or https and then its really just a redirected URL representation as a PURL but that really doesn't help anything. The point of this spec is IIUC to identify packages (which have a specific known type, and some agreed upon coordinates). So these github, bitbucket and generic types are really useless and IMO harmful to the viability of the package-url specification.

jdillon avatar Apr 16 '19 01:04 jdillon

I guess another way to look at this is that "github" and "bitbucket" are where to get something, not what the thing is. I think the github and bitbucket are more applicable to a scm-url abstraction, but not to a package-url abstraction. In the later it doesn't express enough normalized information about what the thing is to even know how to deal with it. It may as well just be an http url. And if you really wanted to provide some additional namespace, name, version ontop of that then generic would work better too, though that still doesn't really provide any more details to infer what the thing is, only where you get it plus what to call it which isn't useful either.

jdillon avatar Apr 16 '19 01:04 jdillon

+1 for removing these

grv87 avatar Jun 06 '19 17:06 grv87

I'm good with removing 'github' and 'bitbucket', but I'm on the fence about 'generic'. We may want to leave that in the spec as a fall-back.

stevespringett avatar Jun 06 '19 21:06 stevespringett

  • gitlab and sourceforge

grv87 avatar Jun 07 '19 13:06 grv87

The rationale for github, bitbucket, gitlab, sourceforge and similar as types is explained in these discussions:

  • https://github.com/package-url/purl-spec/pull/1#discussion_r152653257
  • https://github.com/package-url/purl-spec/pull/1#discussion_r151955703
  • https://github.com/package-url/purl-spec/pull/1#discussion_r151820754

As for the need for something "generic" see this:

  • https://github.com/package-url/purl-spec/pull/1#discussion_r152008188 The overall rationale there is there is a need for things that are important and not in a package registry. This would be typical for things such as upstream OpenSSL, upstream GCC, upstream nginx and so on. These would typically be important and yet not federated in some package repository.

pombredanne avatar Nov 25 '19 21:11 pombredanne

-1 for removing these unless there is a concise alternative

If sub-paths are allowed as discussed in #63, then one could have e.g. pkg:git/github.com/package-url/purl-spec or pkg:git/gitlab.denx.de/u-boot/u-boot, which is nicely readable.

On the contrary, something like pkg:generic/bitwarderl?vcs_url=https://git.fsfe.org/dxtr/bitwarderl is for me very unfriendly.

gotthardp avatar Aug 02 '20 12:08 gotthardp

-1 for removing all of these.

A vast amount of software is not in language-specific repositories and never will be, but they still need to be referred to. The Linux kernel, CPython, and many other things are in that camp.

Having specific names for common repositories like GitHub & Bitbucket is useful because once you know it's on GitHub or BitBucket there are a number of specific things you can know how to handle. You also need a "generic" one to handle the many programs that aren't on GitHub, Bitbucket, or other generic repos; some, for example, have their own.

If the "generic" value could be made nicer-looking I'm all for that, but there needs to be a specific suggested alternative.

david-a-wheeler avatar Oct 01 '20 01:10 david-a-wheeler

What does github: actually mean? Does it mean a git checkout from the specified ref? (If so, depth 1 or full clone?) Does it mean a download of release from GitHub? (If so, which file, since there is often more than one?) This seems under-specified currently.

MarkLodato avatar Feb 08 '21 22:02 MarkLodato

@david-a-wheeler, examples that you provide — Linux kernel, CPython — aren't consumable packages. Of course you could download source code — but it doesn't tell you how to process it.

I think these packages should have make type. The same approach as with golang. Also, some others would require cmake.

Of course, this doesn't cover Windows, — but I don't see what purl could do here, until all cross-platform projects start to use CMake or Conan. For CPython under Windows, maybe, pkg:generic/cpython@version?vcs_url=git+https://github.com/python/cpython#PCbuild/build.bat?

grv87 avatar May 02 '21 14:05 grv87

@MarkLodato re:

What does github: actually mean? Does it mean a git checkout from the specified ref? (If so, depth 1 or full clone?) Does it mean a download of release from GitHub? (If so, which file, since there is often more than one?)

It means either a checkout or a download at some commitish which could be a tag or a commit for the whole repo. If you want to identify a subset as a single path use the #subpath for this. If you want to identify a specific download asset that would have been added to a "release", then I suggest using the "download_url" qualifier for this.

This seems under-specified currently.

Can you elaborate? let's make it better in anycase!

pombredanne avatar May 03 '21 14:05 pombredanne

@grv87 re:

@david-a-wheeler, examples that you provide — Linux kernel, CPython — aren't consumable packages. Of course you could download source code — but it doesn't tell you how to process it. I think these packages should have make type. The same approach as with golang. Also, some others would require cmake. Of course, this doesn't cover Windows, — but I don't see what purl could do here, until all cross-platform projects start to use CMake or Conan. For CPython under Windows, maybe, pkg:generic/cpython@version?vcs_url=git+https://github.com/python/cpython#PCbuild/build.bat?

I do not think I would want to have purl specify how to process some code. Identifying a package yes, but what to do with it would likely be out of scope.

That said, I could see how an "autotools" or "autoconf" package type could make sense in some case, but weakly: you can have a name and version defined there in a quasi manifest-like file with a proper autoconf.ac input such as linked below. But this would be pretty close to "generic" in all other respects since it (type/ns/name@version) would not be enough to locate the package short of a qualifier for a download or vcs URL.

See https://github.com/jeremylong/DependencyCheck/blob/3d097c4838d21d861753a0ea77cde92af19fac40/core/src/test/resources/autoconf/readable-code/configure.ac#L46 for an example of something that resembles a name and version in an autoconf file.

Can we get that from cmake too? Not sure.

See also @david-a-wheeler excellent tutorials on the topic at https://web.archive.org/web/20141029232210/http://www.dwheeler.com/autotools

pombredanne avatar May 03 '21 15:05 pombredanne

@pombredanne

What does github: actually mean? Does it mean a git checkout from the specified ref? (If so, depth 1 or full clone?) Does it mean a download of release from GitHub? (If so, which file, since there is often more than one?)

It means either a checkout or a download at some commitish which could be a tag or a commit for the whole repo. If you want to identify a subset as a single path use the #subpath for this. If you want to identify a specific download asset that would have been added to a "release", then I suggest using the "download_url" qualifier for this.

IMO it would be best to remove all VCS URLs because SPDX Download Location already serves the same purpose. The downside is that it gives a particular protocol (https vs git vs ssh) but IMO the simple solution is for implementations to understand the mapping an always prefer the https one.

If you end up keeping it, here are my suggestions:

  • There should be a "git" type, not "GitHub". You can just as easily use "github.com" as a namespace. This is both better documenting and also more easily extensible to other hosts. You'd have to restrict this to https, I guess.
  • The documentation says "github for Github-based packages", but it's not GitHub packages, it's plain old git.
  • The documentation does not explain that this is a checkout. Think of it this way: if a compliant implementation fetches a bare repo, or a shallow clone, or throws away the .git directory, does that comply with the spec? Right now that is not explained. My recommendation is to say "checkout, MAY be shallow, MUST have the .git directory".

MarkLodato avatar May 03 '21 19:05 MarkLodato

SPDX is not the only SBOM format, nor was it the first SBOM format to support purl. Most purl implementations are outside of SBOM formats. The goal of purl is to identify and locate. Let's not bring SBOM formats into this discussion.

stevespringett avatar May 03 '21 20:05 stevespringett

@stevespringett This has nothing to do with SBOM or SPDX itself. Within the SPDX spec, there is a URI format that describes a download of a version from a VCS such as git, hg, or svn. It is actually very similar to PURL. Example: git+https://github.com/package-url/purl-spec@master.

MarkLodato avatar May 03 '21 20:05 MarkLodato

Dear @MarkLodato , you wrote:

IMO it would be best to remove all VCS URLs because SPDX Download Location already serves the same purpose. The downside is that it gives a particular protocol (https vs git vs ssh) but IMO the simple solution is for implementations to understand the mapping an always prefer the https one.

I am sorry if this feels a bit confusing at first, but github, gitlab or bitbucket package types are not meant to be VCS URLs. FWIW, I happened to have contributed the spec at https://spdx.github.io/spdx-spec/3-package-information/#37-package-download-location which I stole from the Python pip specs; they are complementary to and usable with purls, but they are not purls. VCS URLs are strictly focused on how to access and collect files over some version control system using a certain transport.

In contrast, the Package URL types for github, gitlab or bitbucket are capturing something larger; there are many things covered by these beyond just a git repo: there are APIs, issues, releases, wiki, etc. all which can be determined from the type and form something that is package-like and goes beyond a "mere" VCS URL.

If you end up keeping it, here are my suggestions:

  • There should be a "git" type, not "GitHub". You can just as easily use "github.com" as a namespace. This is both better documenting and also more easily extensible to other hosts. You'd have to restrict this to https, I guess.

I get you point but in the light of my explanation above a git type would be strictly limited to VCS operations and therefore would not be a purl to me, but instead better described by a VCS URL.

  • The documentation says "github for Github-based packages", but it's not GitHub packages, it's plain old git.

You have a good point! Github added a package repo feature later. This creates an interesting twist alright. The way I would handle this would be to have not one but multiple purls. Say there is a GitHub repo that exposes also GitHub packages for instance this example: https://github.com/Codertocat/hello-world-npm

I can spot these two purls:

  1. pkg:github/Codertocat/[email protected]
  2. pkg:npm/@codertocat/[email protected]?repository_url=https://npm.pkg.github.com

And this VCS URL (and likely more like this):

  1. git+https://github.com/Codertocat/[email protected]

You also wrote:

  • The documentation does not explain that this is a checkout. Think of it this way: if a compliant implementation fetches a bare repo, or a shallow clone, or throws away the .git directory, does that comply with the spec? Right now that is not explained. My recommendation is to say "checkout, MAY be shallow, MUST have the .git directory".

That's a good point and this would apply to a VCS URL then may be? Or may be there is something I do not understand: to me when I consider some checkout or archive at some commitish or tag I do not see much difference to have a .git directory or a shallow or deep clone. The code is the same to me in all cases and the presence or absence of VCS metadata may not matter? If there is something specify wrt. VCS URLs or a github/gitlab purl type that would therefore be rather:

"checkout, export or archive, ignoring the version control meta files and directories such as .git, .svn or .hg"

May be you can elaborate a little on your use case? Is this to support the spec at https://github.com/in-toto/attestation ?

And you further wrote:

This has nothing to do with SBOM or SPDX itself. Within the SPDX spec, there is a URI format that describes a download of a version from a VCS such as git, hg, or svn. It is actually very similar to PURL. Example: git+https://github.com/package-url/purl-spec@master.

You might have notices that these are referenced in the spec here https://github.com/package-url/purl-spec/blame/b6f01891a7dca9e81973a119f96080724fba1c9f/README.rst#L172

  • Version control system (VCS) URLs such git://, svn://, hg:// or as defined in Python pip or SPDX download locations are NOT valid purl types. They are valid URL or URI schemes but they are not purl. They are a closely related, compact and uniform way to reference vcs URLs. They may be used as references in separate attributes outside of a purl or in a purl qualifier.

And here are links to some of the original discussions on the topic of (github, bitbucket, gitlab, sourceforge) that may be of interest:

  • https://github.com/package-url/purl-spec/pull/1#discussion_r152653257 "bitbucket isn't really a consumer though, is it? Like, what kind of package does that specifier refer to?"

  • https://github.com/package-url/purl-spec/pull/1#discussion_r151955703 "But by including github, you make the implicit claim that a repository on Github can be seen as a software package."

  • https://github.com/package-url/purl-spec/pull/1#discussion_r151820754 "I'm not sure why github: and bitbucket: would be valid but git: would not be (the same applies to the others)? "

pombredanne avatar May 06 '21 14:05 pombredanne

@pombredanne Thanks for the detailed and thoughtful response!

In contrast, the Package URL types for github, gitlab or bitbucket are capturing something larger; there are many things covered by these beyond just a git repo: there are APIs, issues, releases, wiki, etc. all which can be determined from the type and form something that is package-like and goes beyond a "mere" VCS URL.

That makes sense, but doesn't appear to be documented anywhere. The docs say that purl's goal is to "reliably identify and locate software packages" but never defines the term "package." I interpreted "package" to be some software artifact (one or more blobs of data), and thus to locate it means how to download it. I suggest documenting this somewhere, and documenting what github actually means.

Does this same definition apply to other package types? For example, if you use a different repository_url with docker, doesn't that affect other metadata about the package?

The way I would handle this would be to have not one but multiple purls.

Agreed, though I still think you'll need a type for "a file downloaded from GitHub releases". Not all releases can fit into the seven package registries that GitHub supports. Example: https://github.com/curl/curl/releases.

That's a good point and this would apply to a VCS URL then may be? Or may be there is something I do not understand: to me when I consider some checkout or archive at some commitish or tag I do not see much difference to have a .git directory or a shallow or deep clone. The code is the same to me in all cases and the presence or absence of VCS metadata may not matter? If there is something specify wrt. VCS URLs or a github/gitlab purl type that would therefore be rather:

"checkout, export or archive, ignoring the version control meta files and directories such as .git, .svn or .hg"

May be you can elaborate a little on your use case? Is this to support the spec at https://github.com/in-toto/attestation ?

Sorry for the confusion. My thinking was that a purl would unambiguously identify a software artifact. When you "fetch" a purl, you always get the same bit-for-bit identical output regardless of implementation (assuming that the server serves the same data each time.) For example, when asked to fetch pkg:deb/debian/[email protected]?arch=amd64, the implementation would download the file located at https://deb.debian.org/debian/pool/main/d/dpkg/dpkg_1.19.0.4_amd64.deb. But with pkg:github/package-url/purl-spec@244fd47e07d1004, an implementation could fetch that and check it out as a bare repo, or as a directory with .git, or without .git, all still being compliant with the spec. That seems wrong. But maybe I'm using purl incorrectly.

Yes, my interest is with attestations, particularly provenance. My thinking is that, if the provenance records all transitive dependencies as purls, then a build system could fetch all of those artifacts up front then run the build hermetically (no network access). But to do that, the system needs to know how to resolve a purl into files on disk. (This is still very hand wavy right now - I hope to flesh out a more concrete proposal in the coming months.)

MarkLodato avatar May 07 '21 17:05 MarkLodato