purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

How are golang sub-modules supposed to be expressed by purl?

Open andrewstein opened this issue 4 years ago • 26 comments

I am confused reading the spec for purl in relation to golang sub-modules. For example, looking at the submodule expressed in this go.mod file: https://github.com/go-modules-by-example/submodules/blob/master/a/go.mod, released by the a/v1.0.0 tag: https://github.com/go-modules-by-example/submodules/releases

Is the purl:

  1. pkg:golang/github.com/go-modules-by-example/submodules/[email protected]
  2. pkg:golang/github.com/go-modules-by-example%2Fsubmodules%[email protected]
  3. pkg:golang/github.com/[email protected]#submodule/a
  4. pkg:golang/github.com/go-modules-by-example/[email protected]#a
  5. pkg:golang/github.com%2Fgo-modules-by-example%2Fsubmodules%[email protected]

It basically comes down to what is the namespace (if any), what is the name and what is the sub-path (if any) for this submodule.

andrewstein avatar Aug 01 '19 21:08 andrewstein

A followup note: I am not sure that even without golang sub-modules that the spec is reflective for golang. https://github.com/package-url/purl-spec#known-purl-types give the example

pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c

This implies that the namespace is github.com/gorilla and the name is context. This does not seem right to me. I would expect the name to be gorilla/context in the github.com namespace, or preferably github.com/gorilla/context without a namespace. Leading to one of the following purls:

  1. pkg:golang/github.com/gorilla%2Fcontext@234fd47e07d1004f0aed9c
  2. pkg:golang/github.com%2Fgorilla%2Fcontext@234fd47e07d1004f0aed9c

But maybe I am just being argumentative here.

andrewstein avatar Aug 15 '19 13:08 andrewstein

I don't believe use of subpath here is appropriate, as IIUC subpath is used to point to something inside of a package subpath: extra subpath within a package, relative to the package root.

Its certainly a bit wrinkly with golang modules def of repository and module though, and maybe subpath should be expanded for that use-case? Though I think similar to html anchors and urls using fragments to point to something inside of a page the same thing would apply here for purl to point to something inside of a specific package.

jdillon avatar Aug 15 '19 18:08 jdillon

Regarding the github org/user and repository bits, IIUC golang's module stuff doesn't require a module be a github url or a git repository (though it may mostly commonly be such).

Does not appear that the coordinates used for golangs modules really care about? I didn't (after a very brief scan of the docs) see that the value for require was even defined (but I could have missed it) but looks generally just like a "host:path version"?

For git submodule package looks like the only wrinkle is if you wanted to find the tag, that you need to know the root repository location so you could then figure out what the path to the sub-module was?

It may also depend on what one would do with a golang purl, seems like no matter how you spin it some translation would have to be done, but I think thats probably fine. For example a maven purl with dot notation in groupId would have to get translated to slash notation for resolving a file on disk or remote repository location.

So my guess is that avoiding any front-loaded assumptions on the golang package url is probably simplest, and that your first example:

pkg:golang/github.com/go-modules-by-example/submodules/[email protected]

... is probably reasonable.

Just my 0.02 though... i'm not a golang module expert by far ;-)

jdillon avatar Aug 15 '19 19:08 jdillon

from https://github.com/golang/go/wiki/Modules:

Modules must be semantically versioned according to semver, usually in the form v(major).(minor).(patch), such as v0.1.0, v1.2.3, or v1.5.0-rc.1. The leading v is required. If using Git, tag released commits with their versions. Public and private module repositories and proxies are becoming available (see FAQ below).

If the "leading v is required" then maybe the purl form is:

pkg:golang/github.com/go-modules-by-example/submodules/[email protected]

... though its not really clear if thats a hard requirement or not.

jdillon avatar Aug 29 '19 01:08 jdillon

I think this is an actual problem.

Here's some real life examples:

Go module name namespace name
github.com/gorilla/context github.com/gorilla context :+1:
github.com/Azure/go-autorest/logger github.com/Azure/go-autorest logger :woman_shrugging:
rsc.io/quote/v3 rsc.io/quote v3 :-1:

1st one makes sense to me. 2nd one could go either way: some might consider it correct, others might say it should be namespace = github.com/Azure, name = go-autorest/logger. 3rd one is a problem. Go treats major version numbers as a separate module. So rsc.io/quote v1.0.0 is different than rsc.io/quote/v3 v3.0.0 (and it's illegal to say rsc.io/quote v3.0.0 without the /v3).

To ensure consistency we should document how to handle submodules.

Some low effort options I can think of:

  1. Continue splitting things up the way they are now, where name = v3: rsc.io/quote/v3 and github.com/Azure/go-autorest/logger Action: nothing
  2. Say all submodules and/or major versions are part of the name field: rsc.io/quote%2Fv3 and github.com/Azure/go-autorest%2Flogger Action: add some README examples.
  3. Say Go only cares about the name, not namespace, so all slashes need percent encoding: rsc.io%2Fquote%2Fv3 and github.com%2FAzure%2Fgo-autorest%2Flogger Action: change README examples.
  4. Say the the repository (rsc.io, github.com) is the namespace, everything else is the name: rsc.io/quote%2Fv3 and github.com/Azure%2Fgo-autorest%2Flogger Action: change README examples.

bradcupit avatar Nov 22 '19 20:11 bradcupit

The original intent has been to use subpath for Go, but this pre-dates the rise of modules. Actually, AFAICR subpath was added specifically to support Go "packages".

@andrewstein with your examples:

  • module github.com/go-modules-by-example/submodules/a should be : pkg:golang/github.com/go-modules-by-example#submodules/a

  • require github.com/go-modules-by-example/submodules/b v0.1.1 should be : pkg:golang/github.com/[email protected]#submodules/b

@bradcupit with your examples:

  • pkg:golang/github.com/gorilla/context
  • pkg:golang/github.com/Azure/go-autores#logger
  • pkg:golang/rsc.io/quote#v3 (NB see also #67 for a discussion on go vs. golang)

My personal preference would be avoid overloading the namespace and name and continue to use the subpath if this can make sense generally for the Go community and experienced Go folks ( @robpike ping! ).

The rationale is that in practice a good number if not a majority of public Go modules do end up fitting this approach: there is some repo or web site (Github, Gitlab, Bitbucket) that has mostly a two-level structure: "org or owner or user"/"name of project" and that level is typically what has a common set of attributes (ownership, team, release process, licensing, etc.) and there are "subpath" that extend inside this which are things effectively imported in Go.

To the best of my knowledge this ("org or owner or user"/"name of project") is also what to the Go toolchain would fetch in a workspace: the whole namespace/name would be fetched and specific subpaths would be selectively imported (I may be wrong there as I did not dive deep inside go get and Go modules code.)

Side note: IMHO there would not be many Package URL use cases to reference a specific deeply nested piece of Go code (e.g. using a subpath as suggested here) as opposed to the whole ns/name at once. What would be yours?

pombredanne avatar Nov 25 '19 16:11 pombredanne

@pombredanne thank you so much for responding!

tl;dr: Though the existing purl spec works, I think we've accidentally made something impossible for our users.

  • module github.com/go-modules-by-example/submodules/a should be: pkg:golang/github.com/go-modules-by-example#submodules/a

That proposal works with all the existing code and examples. Users can take pkg:golang/github.com/go-modules-by-example#submodules/a and one of the purl libraries can split it to the various namespace, name, version, etc. parts. From there if a user wants to determine the Go module name, they can do so easily. We don't have to change anything in the spec or libraries.

Having said that, users (including the team I'm on) will write code that converts a Go module name and version to a purl string. This is easy for well known repos like github and bitbucket, but difficult for custom module names. Here's a real-world example:

v.io/x/ref/lib/flags/sitedefaults

Where does the parent module end and the submodule begin? What's the namespace and what's the name? We can't tell the answer to either without analyzing the Go module's git repo.

If users write their own code to do this should they set namespace = v.io, name = x, and subpath = ref/lib/flags/sitedefaults? In this particular case we can look at the Go module's git repo and see the parent module name is v.io so there is no namespace. That means our users would've chosen the incorrect namespace, and got a different purl string as the final output: pkg:golang/v.io/x#ref/lib/flags/sitedefaults vs pkg:golang/v.io#x/ref/lib/flags/sitedefaults (the # appears in a different spot).

Ultimately we can only guide our users and the onus is on them to split things up correctly. But I can't see a reliable way to split go modules into namespace and name without analyzing the module's git repo. And I'd assume most code converting a module name to a purl string will just have the module name string, not the entire git repo, as is the case for my company.

Idea

tl;dr just README changes, no code changes, but we percent-encode a lot more

Perhaps we should consider changing the README examples so they don't use namespace and instead only use name? And since we can't always tell where the parent module ends and the submodule begins we could also treat submodules the same as module names, instead of like subpaths. These two suggestions make it much easier for users to set the right values for namespace (which would always be blank now) and name, and then get consistent purl strings as the output. The downside: names are percent encoded, so the README purl strings would change. Examples:

Go module /submodule before after
github.com/gorilla/context pkg:golang/github.com/gorilla/context pkg:golang/github.com%2Fgorilla%2Fcontext
rsc.io/quote/v3 pkg:golang/rsc.io/[email protected]#v3 pkg:golang/rsc.io%2Fquote%[email protected]
v.io/x/ref/lib/flags/sitedefaults pkg:golang/v.io#x/ref/lib/flags/sitedefaults pkg:golang/v.io%2Fx%2Fref%2Flib%2Fflags%2Fsitedefaults

I can't think of any other way to make these two problems easier on users. Thoughts?

bradcupit avatar Dec 20 '19 14:12 bradcupit

@bradcupit I agree with your proposal — for go, there is not “namespace/name” concept. And if one is to drag submodules into the mix, there is no way to know, just looking at the import path, where the module ends and the submodule begins. Treating the whole thing as a single name is the only way as far as I can see.

andrewstein avatar Dec 20 '19 16:12 andrewstein

In @bradcupit propasal, would using subpath to point to subpackages (not declared as submodules) still make sense?

For instance, if a purl should point to v.io/x/ref, would it make any difference to assemble the purl as pkg:golang/v.io%2Fx%2Fref or as pkg:golang/v.io#x/ref? It seems like it would still make sense to use the first option and not use the subpath here since we could suffer from the same issue of not knowing where to split the components. However, would the second form still be valid?

In other words, should the approach be valid for both subpackages and submodules?

athos-ribeiro avatar Mar 26 '20 10:03 athos-ribeiro

Please consider also readability and auditability of the PURL. From the usability perspective is pkg:golang/v.io#x/ref or even pkg:golang/v.io/x/ref (because that is the actual package name) more easily readable and auditable. The pkg:golang/v.io%2Fx%2Fref is perhaps easier to process for machines, but I prefer usability even if the implementation is a bit harder.

gotthardp avatar Aug 02 '20 12:08 gotthardp

@athos-ribeiro said

would using subpath to point to subpackages (not declared as submodules) still make sense? ... would the second form still be valid?

Sorry for the late reply! It would make sense to me, assuming you need to know the subpath. I don't have a use case for that myself, but if you wanted to point to a particular file inside a go repo using the #subpath would still be valid.

The only reason we're percent encoding the / in the name is because we have to according to the purl spec. If there are no slashes in the name (because they've moved to the subpath and you're trying to point to a subpath instead of identifying a submodule) then there's nothing to percent encode.


@gotthardp said:

Please consider also readability and auditability of the PURL

Yeah, I personally hated what's in my suggestion. I very much prefer the version that's easier to read, meaning, the one without percent encoding, but I don't think the pretty version is realistic.

The pkg:golang/v.io%2Fx%2Fref is perhaps easier to process for machines, but I prefer usability even if the implementation is a bit harder.

That makes sense, and from the perspective of writing the purl-spec it makes sense too, but I think we have to consider how people are going to use the purl-spec. People will have the 'coordinates' of a package and want to convert that into a purl string.

For maven the coordinates are the groupId, artifactId, and version, which is enough to compute a purl string. For Go you can't just have the module name to generate the purl string: you'd need the whole url. So if you instead have a git repo URL as your coordinates you may not have enough info to generate a purl string with the current spec. It works for normal cases, like github repos, but it fails for odd cases like v.io/x/ref. So either you have to require both the VCS repo and the go module name, or you require just the VCS repo, then programmatically clone it and parse the go module name. Now you'd have enough info to generate the purl string.

Or, we just do the simple thing: require only the repo URL and stuff it all in the name field and percent encode it.

bradcupit avatar Aug 03 '20 20:08 bradcupit

~I think the proposal we put forth violates a part of the purl spec:~

~namespace:~ ~...~ ~* When percent-decoded, a segment:~ ~* must not contain a '/'~

~So if we go with the solution proposed here we'd have to change the above part of the spec too, or make an exception for Go.~

bradcupit avatar Sep 23 '20 20:09 bradcupit

@jdillon told me how the namespace encoding works (namespaces can contain slashes and we only encode what's between the slashes) -- plus I was totally wrong, we're proposing ditching the namespace for Go, so please ignore the previous comment.

bradcupit avatar Sep 24 '20 13:09 bradcupit

He also mentioned it wasn't clear what this issue is proposing, so here's the shorter version of what @andrewstein proposed (and what I echo):

Problem For some repos (not github, not gitlab, but others) it's impossible to convert a repo URL + submodule paths or Go module name + submodule paths to a purl string.

Proposal

  1. stop using namespace in Go purl strings
  2. put the entire Go module name in the name, and percent encode it
  3. make minor updates to the spec, no code changes required

Example

pkg:golang/github.com%2Fgorilla%2Fcontext

bradcupit avatar Sep 24 '20 13:09 bradcupit

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like. It could be made such that this is backward compatible for every package type. I shall say that Go's notion of a package which is really a subdirectory in some repo is not really amenable to clean identification (and leads to an explosion of the number of imports being tracked if you care to track things this way for software composition analysis )

pombredanne avatar May 03 '21 15:05 pombredanne

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like

yes @pombredanne ! 💯 👏 🏆

bradcupit avatar Jun 01 '21 15:06 bradcupit

Just as a snapshot how tools handle that today (for the example https://pkg.go.dev/github.com/russross/blackfriday/v2 in version v2.1.0):

  • ORT: pkg:golang/github.com%2Frussross%2Fblackfriday%[email protected]
  • Syft: pkg:golang/github.com/russross/blackfriday/[email protected]
  • SCTK: pkg:golang/github.com/russross/blackfriday/[email protected]
  • component-detection:
    • "Type": "golang",
    • "Namespace": null,
    • "Name": "github.com/russross/blackfriday/v2",
    • "Version": "v2.1.0",

maxhbr avatar Jun 27 '22 09:06 maxhbr

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like

Just wanted to mention I've written a bunch of special-casing code for golang this week to try to parse the namespace. The difficulty lies in guessing the number of slashes in a namespace, e.g.

  • google.golang.org: no slashes
  • github.com/spf13: one slash
  • gopkg.in: can be either no-slash (e.g. for the module gopkg.in/yaml.v2 ) or one slash (e.g. for the module gopkg.in/urfave/cli.v1) 😭

And there are even examples of go modules that dont have a namespace, e.g. gotest.tools (this is the full name of the module)

Knowing the number of slashes is important so you can split on them, and guess which part is the namespace, name, or subpath, but it's nearly impossible to do for go. For instance, it's not clear which of these cases is git.host/foo/bar/baz:

  • namespace: "git.host/foo", name: "bar", subpath: "baz"
  • namespace: "git.host", name: "foo", "subpath": "bar/baz"
  • namespace: nil, name: "git.host/foo", "subpath: "bar/baz"

So splitting on slashes or even having prior knowledge of a VCS host is not really enough to make out the namespace vs the name. Given that, I agree with @bradcupit to squash the idea of a namespace for golang.

tiegz avatar Oct 13 '22 01:10 tiegz