purl-spec
purl-spec copied to clipboard
A purl spec for the C/C++ package manager vcpkg
This would address Issue #217. It is also being tracked on the vcpkg side in this discussion. https://github.com/microsoft/vcpkg/discussions/32732
I'm contemplating whether or not we can simplify the spec further and whether it's still missing anything vital.
The principle that the components of the purl string should only be what is necessary to disambiguate different packages from each other runs into a problem of having the same package used in different ways by the same consumer (say, it depends on a static build of the package on one platform and a dynamic build of the package on another platform).
Where I'm headed with this is to distinguish between 2 kinds of data.
- Data that identifies a package.
- Data that describes how a package is utilized in a particular context.
Data under 1. should be specified in the purl specification and the specification should allow for the flexibility to record data under 2. Properties for that 2. data could be also documented in a subparagraph, but it would be non-normative with regards to the purl spec.
With that approach, I think it means that the existing abi, triplet, and features properties would fall under category 2. And in order to properly distinguish, I think we'd need an indicator whether or not a package is a port or an artifact. That information could be encoded in the namespace field or as a property; I'm leaning towards a property for now.
I'm going to make those changes soon, and want this comment to serve as a record of the thought process behind the decision.
After discussing it some with the vcpkg team, I don't think we need to necessarily distinguish ports from artifacts at this time, so I've made the changes mentioned above, but have not added a package_type qualifier yet (and seems that I likely will not).
Alright. I'm moving this out of draft status. I'm looking for reviews from stakeholders in addition to maintainers of the purl-spec repository. Pinging @dan-shaw, @ras0219-msft, @jhutchings1, @adriandiglio.
I think the purl spec is somewhat unclear on what guarantees users are to expect of PURLs that make making a judgement call on whether this represents vcpkg effectively there. Couple of questions:
- Is it expected that a PURL could in principle be given to a package manager, and attempt to produce the same package?
- If two differnet PURLs can refer to the same content, is that OK?
- If the same PURL can refer to totally different content, is that OK?
Examples like conan already here seem to say that different URIs to the same content as well as the same URI potentially identifying different content are OK but that seems to make PURLs almost meaningless.
Given a PURL, what is a user expected to be able to do with it? I look at what SPDX and what GitHub dependencies and similar want, and they want such vastly different things and want different guarantees on what that means.
Potential examples. It isn't clear to me which of these apply. I'm sure I missed some.
- [ ] Uniquely identify the exact content that is executed by an end user at a given time
- [ ] Uniquely identify the exact content that is executed by an end user for all time (e.g. cryptographic SHA)
- [ ] Uniquely identify the source code and/or package recipe that the package manager executes in order to produce something
- [ ] Partially identify the source code and/or package recipe that the package manager executes in order to produce something
- [ ] Identify something enough such that likely software vulnerability information would be applied
I think the purl spec is somewhat unclear on what guarantees users are to expect of PURLs that make making a judgement call on whether this represents vcpkg effectively there.
PURLs should be deterministic. If there are things which could affect what dependency you resolve, they should be available as properties in the PURL schema for a type. The registry is a common example; most types allow you to provide a registry, but have a default option as well.
Reviewing your list, I believe every one of those is a goal. Sometimes purls will have maximum specificity (eg, a runtime might report that it ran a very specific piece of software), and other times, they'll have less (eg, a CVE may specify something like pkg:npm/[email protected] to specify a large range of affected products, at least if the version range spec is added #93 ).
- [x] Uniquely identify the exact content that is executed by an end user at a given time
- [x] Uniquely identify the exact content that is executed by an end user for all time (e.g. cryptographic SHA)
- [x] Uniquely identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Partially identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Identify something enough such that likely software vulnerability information would be applied
@jhutchings1 How do we reconcile that with many existing examples that fail most of these tests?
Examples:
pkg:conan/[email protected]
- [ ] Uniquely identify the exact content that is executed by an end user at a given time Nope, not built yet
- [ ] Uniquely identify the exact content that is executed by an end user for all time (e.g. cryptographic SHA) Nope, different registries can say what openssl means is totally different
- [ ] Uniquely identify the source code and/or package recipe that the package manager executes in order to produce something Nope, the recipe can change what happens depending on the machine it runs on
- [x] Partially identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Identify something enough such that likely software vulnerability information would be applied Asterisk: No way of identifying backports
pkg:cargo/[email protected]
- [ ] Uniquely identify the exact content that is executed by an end user at a given time Nope, not built yet
- [ ] Uniquely identify the exact content that is executed by an end user for all time (e.g. cryptographic SHA) Nope, not built yet
- [x] Uniquely identify the source code and/or package recipe that the package manager executes in order to produce something At least, I think ?
- [x] Partially identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Identify something enough such that likely software vulnerability information would be applied
pkg:nuget/[email protected]
- [ ] Uniquely identify the exact content that is executed by an end user at a given time Nope, depends on target configuration
- [ ] Uniquely identify the exact content that is executed by an end user for all time (e.g. cryptographic SHA)
- [ ] Uniquely identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Partially identify the source code and/or package recipe that the package manager executes in order to produce something
- [x] Identify something enough such that likely software vulnerability information would be applied
@jhutchings1 (To clarify, I'm trying to make sure vcpkg's support for this is consistent with the spec's design goals but the front matter seems to be missing these details and the examples seem to not be consistent with what design goals are listed there so I don't know how I feel about it)
Producers should provide as much information as they have, but most properties should be optional so that producers like CVE issuers can issue CVEs that target a broader set of packages with just one purl. There's not a one size fits all approach here, so design for flexibility.
but most properties should be optional so that producers like CVE issuers can issue CVEs that target a broader set of packages with just one purl. There's not a one size fits all approach here, so design for flexibility.
It sounds like the PURL spec should then be viewed as serving two separate purposes:
- As a descriptor for a unique, specific "package"
- As a query language over those descriptors, specifically for CVE matching
Solving (2) is much more complicated than identifying individual packages; there's a certain policy decision of applicability. For example, if I get [email protected] from a different registry, has a fix for CVE 100000 been backported to that variant? Is [email protected] expected to be the same project when it comes from different registries?
More realistically, does [email protected]?package_revision=1 suffer from all the same CVEs as [email protected] or was the entire point of the packaging update to apply patches to fix said CVEs? Are these are expected to be tracked in the same way as 1.0.1 vs 1.0.0: every minor packaging revision is a totally unique source version which (upon initial minting) has no CVEs?
How does this currently work for PURLs into Linux distributions -- especially Debian Stable?
CVE matching seems like it is always messy. For PURL I think typically [email protected] is expected to always be [email protected], not zlib1g@1:1.2.11.dfsg-1+deb10u2 or 1:1.2.11.dfsg-2+deb11u2 depending on what version of Debian the software is installed on. CVEs usually are matched using CPEs like cpe:2.3:a:zlib:zlib:1.2.11:*:*:*:*:*:*:*, which works for things that aren't on a package registry or aren't even standalone software products, but even then a CVE scanner can't know that Debian Buster's zlib1g 1:1.2.11.dfsg-1+deb10u2 (pkg:deb/debian/zlib1g@1:1.2.11.dgsg-1+deb10u2?distro=buster) isn't vulnerable to CVE-2022-37434 without consulting Debian's vulnerability database to find that it was patched in that version by DLA-3103-1.
For PURLs into Debian, the spec says you should have a PURL like pkg:deb/debian/zlib1g@1:1.2.11.dfsg-1+deb10u2?distro=buster which refers to a specific file¹. From there, it looks like you need to translate pkg:deb/debian/zlib1g@1:1.2.11.dfsg-1+deb10u2?distro=buster into pkg:deb/debian/zlib@1:1.2.11.dfsg-1+deb10u2?arch=source&distro=buster (the source package has a different name), which is built from zlib 1.2.11. Then you can either just look up what Debian has in the Debian security tracker, or you can use the Debian security tracker data to map zlib to cpe:/a:gnu:zlib, which is a deprecated alias of cpe:2.3:a:zlib:zlib, and then search for vulnerabilities matching cpe:2.3:a:zlib:zlib:1.2.11:*:*:*:*:*:*:*. Searching for the CPE will return a list of vulnerabilities containing both vulnerabilities that have been patched and vulnerabilities that aren't even in the Debian security tracker data yet, so then you would need to overwrite the global CVE information about zlib:1.2.11 with the matching Debian CVE information about zlib:1.2.11.dfsg-1+deb10u2 to get the final list. It's complicated, but it's probably unavoidable when Debian is shipping multiple packages based on its own fork of zlib.
At least I'm pretty sure that's how it works for tools like Trivy and debscan.
For software library packages being incorporated into a product via bundling or static linking, it's much simpler because the packages are (usually) specific, immutable files in specific repositories, so the question of whether that package is vulnerable or not depends on only the package, of which there is only a single instance, published by the package author (ie CVE-2022-37434 is resolved by upgrading to pkg:cargo/[email protected] which contains zlib 1.2.13, not by making a custom 1.1.11 that uses a patched zlib 1.2.12).
¹ I'm not sure this is useful. Does Debian keep every version of every package forever? I know for Alpine this is not the case, so pkg:apk/ is only going to be useful for describing what you have, not what you want.
This discussion seems to be a bit stale, but I think it is still very relevant, because at the moment vcpkg does not use any CPE or Package URL in the SBOMs it produces. That prevents it from being used easily for automated analysis.
So I would like to revive it and add my 2 cents:
- Specificity is good, but should not prevent us from starting with a minimal purl first. Vcpkg has the notion of overlay ports, overlay triplets, and many more build parameters. But there is a high probability that when a vulnerability is in the original vcpkg registry port, it will also be there in the overlay port and in multiple build situations. So I think a purl like
pkg:vcpkg/[email protected]is best for security purposes. This can match against most vulnerabilities found in the upstream libraries. - To be one step more specific, the port file revision, tracking changes in the packaging files but not in the upstream library, should be supplied as well. Note that it is questionable if the packaging has much influence on security - so this more to reliably identify the package for other purposes. For examples of port file revisions, see e.g. https://vcpkg.link/ports/zlib/versions in which the port file revision is the last digit in e.g.
v1.2.1.2#2. It can't be added with#because that is against the purl standard, it could be done with a different separator likezlib@2:1.0for port file revision2 of zlib 1.0, but the best way to ensure multiple port file revisions can match vulnerabilities in the upstream library easily is to add it as a qualifier i.e.[email protected]?port_revision=2. Adding the registry revision (e.g.143bc76cc7that is specified as subtree revision for https://vcpkg.link/ports/zlib/v/1.2.12/2) seems to be counterproductive because many different revisions will still have the same port file revision of zlib. - Vcpkg has the option to use another registry or other registries than the default https://github.com/microsoft/vcpkg. However, I think this feature is not used much at the moment, except when using local filesystem overlay ports. I agree that adding it as a qualifier e.g.
[email protected]?repository_url=file:///home/user/project/port-overlays/zlibmakes most sense. - Other parameters such as specifying the triplet and all other build parameters etc. can be done. For example, Conan has a documented example
pkg:conan/openssl.org/[email protected]?arch=x86_64&build_type=Debug&compiler=Visual%20Studio&compiler.runtime=MDd&compiler.version=16&os=Windows&shared=True&rrev=93a82349c31917d2d674d22065c7a9ef9f380c8e&prev=b429db8a0e324114c25ec387bfd8281f330d7c5c.
To sum up, I would argue for a simple purl like pkg:vcpkg/[email protected] for now with the qualifiers port_revision and repository_url, only to be added if the repository_url is not the default https://github.com/microsoft/vcpkg and the port_revision is not the default 0.
@pombredanne - What steps still remain in order to merge this spec for a vcpkg PURL type?
@pombredanne @jkowalleck @BillyONeal @stevespringett @jhutchings1 @matt-phylum This PR has been open for in unchanged state for about 2 months already. As not having purls for vcpkg is a security risk, could you please contribute to reviewing and merging this PR?
Update: Fixed some potentially confusing wording.
I would like to add a few thoughts to the comment by @matt-phylum.
For certain communities like Debian, there is a designated security team and process that ultimately allows you to link a specific PURL to vulnerabilities. Correct me if I am wrong, but I do not see this heavyweight approach for the vcpkg community in the foreseeable future. What you can expect to happen then is that aggregators like the Github Advisory Database try to build a somewhat heuristic mapping from a vcpkg PURL to CPEs. This mapping is difficult and messy enough without details such as port revisions. Hence in practice, for vulnerability searches at least, I would expect that only the package name and version are relevant.
Furthermore, similar to Matt's comment, I am not sure how well-suited the PURL concept is for non-binary packages in the first place. With, say Newtonsoft.Json version x.y.z, the situation is simple: There is just one package, it has a specific content, and you do not care how it was built, because you only ever consume the build artifact (neglecting nasty edge cases like build extensions). For a vcpkg port like libcurl or ffmpeg, this concept is questionable; depending on the chosen features, you have completely different software. You can partially overcome this problem by adding (lots of) qualifiers. My only problem here is: What for?
- The PURL spec itself is not very specific what a PURL is good for. In particular, what you need a unique identifier including the detailed build configuration for.
- Relevant regulation that I have seen always refers to PURLs as identifiers into vulnerability databases. As mentioned above, too much detail is probably a liability, not a feature.
- It might be interesting to note that a CPE is actually not an identifier, but a set description language. Which makes sense then: A vulnerability may affect a specific piece of software (small set), while for a vulnerability search, you my query all vulnerabilities in a larger set (e.g., for build variant "*"). PURLs are only identifiers, although it should not be hard to build a set query language on top of them.
I think my personal recommendation from these thoughts would be not to spend too much effort on trying to construct a good unique identifier at the current state of affairs. Instead, reserve the ability to extend the PURL specification, and focus on those attributes that are likely to be relevant.
For certain communities like Debian, there is a designated security team and process that ultimately allows you to link a specific PURL to vulnerabilities. Correct me if I am wrong, but I do not see this heavyweight approach for the vcpkg community in the foreseeable future. What you can expect to happen then is that aggregators like the Github Advisory Database try to build a somewhat heuristic mapping from a vcpkg PURL to CPEs. This mapping is difficult and messy enough without details such as port revisions.
Is PURL entirely about linking to likely vulnerabilities? The slated goals of the project are about uniquely identifying particular components, and the concerns raised here are about whether vcpkg can actually meet those requirements with any given single string when so much is outside of our direct control.
Hence in practice, for vulnerability searches at least, I would expect that only the package name and version are relevant.
But the same package name and version from different registries or built in different environments are entirely different packages. If someone wants to answer questions like "was this package built with /guard:cf" they might expect to be able to do that by looking at a PURL.
With, say Newtonsoft.Json version x.y.z, the situation is simple: There is just one package, it has a specific content, and you do not care how it was built, because you only ever consume the build artifact (neglecting nasty edge cases like build extensions).
Even e.g. NuGet's spec here doesn't meet the PURL spec's goals because it doesn't identify the package's source. I can make up my own 'Newtonsoft.Json' .nupkg x.y.z that actually contains something entirely different.
That existing less problematic providers like NuGet aren't in keeping with the documented goals of PURL suggests that those aren't the actual goals...
Relevant regulation that I have seen always refers to PURLs as identifiers into vulnerability databases. As mentioned above, too much detail is probably a liability, not a feature.
If 'likely key for vulnerability matching' is the actual goal here, many of the concerns go away. Features probably still matter as one could easily imagine 'vulnerability in the ffmpeg[some-file-format] feature.' For example, there are often vulns in codecs, and ffmpeg can be built bundled with every codec under the sun, but 99% of real customers care about a relatively small number of formats.
What do you mean 'regulation'?
I think my personal recommendation from these thoughts would be not to spend too much effort on trying to construct a good unique identifier at the current state of affairs. Instead, reserve the ability to extend the PURL specification, and focus on those attributes that are likely to be relevant.
Can we get a statement from PURL's actual maintainers that that is, in fact, the goal? I understand that vulnerability databases might want something that looks like PURL to uniquely identify customers who might be affected by a vulnerability, but that doesn't mean that PURL's maintainers believe that that is the thing's purpose to be.
a CPE is actually not an identifier, but a set description language
CPE?
@BillyONeal: I second the point that I am not quite sure about the specific purpose of PURLs. In particular, the goal of specifying the exact recpie under which a package is built seems ambitious. And I do have trouble finding a good use case for that where you do not have access to the vcpkg install folder (and where vcpkg could dump this information with less rigid structure if requested).
If you drop this goal or consider it optional, a unique identification could be provided by "hashing the package".
Regulation == "state of the art security practices" that appears in, say, the Cyber Resilience Act or a similar executive order with a long name.
Regarding CPEs, these are the "identifiers" used by the National Vulnerability Database. And while not all is great, their basic principle looks sound: A CPE is not an identifier, but a set description language. For example, you could choose the vendor:product:version triple part as "python:python:3.11.9" to denote a specific instance (or rather subset) of Python, or use "python:python:*"to denote all versions of Python. That allows you to associate data, for example CVEs or licenses, with an arbitrarily precise set of software. However, doing this well is not easy, and while PURLs seem to head a bit into this direction (e.g., version ranges), they are currently not set up for such an approach
Just throwing in my $0.02... the purpose of this PURL is 100% to aid in matching vcpkg packages to vulnerabilities. vcpkg already provides all the specifics in different formats covering everything you may want to know about the package, except the vulnerabilities associated with it. I don't know why this was overlooked, but anyways this is what people are waiting for. Product security teams triage findings anyways, it doesn't have to be perfect, but please do make the basic functionality work.
@BillyONeal: I second the point that I am not quite sure about the specific purpose of PURLs. In particular, the goal of specifying the exact recpie under which a package is built seems ambitious. And I do have trouble finding a good use case for that where you do not have access to the vcpkg install folder (and where vcpkg could dump this information with less rigid structure if requested).
To clarify, the requirement to identify a particular package comes from the README, not from me:
https://github.com/package-url/purl-spec/blob/8040ff0be50f0c5b1986b1a0947bd539f5405fc4/README.rst?plain=1#L62-L63
and the litmus tests I described above in September 2023 are attempting to explore the contours what the README says. I'm nervous as a vcpkg maintainer to claim that we adopt this spec in a manner which seems to contradict the spec's slated purpose.
If you drop this goal or consider it optional, a unique identification could be provided by "hashing the package".
To be clear, I believe that goal is the goal of PURL. I'm trying to reconcile what I understand PURL's goal to be with how our system works.
Regulation == "state of the art security practices" that appears in, say, the Cyber Resilience Act or a similar executive order with a long name.
I don't see anything about PURL in that Act, so unfortunately it isn't very helpful in determining our compliance, or lack thereof, with the Act.
The closest I've seen like this in e.g. the Biden CyberEO or that Act is 'there will be SBOMs' but no requirements whatsoever as to what the SBOM actually contains. I know we emit SPDX SBOMs but don't have a PURL; I'm not sure what the impact of that is.
Regarding CPEs, these are the "identifiers" used by the National Vulnerability Database. And while not all is great, their basic principle looks sound: A CPE is not an identifier, but a set description language. For example, you could choose the vendor:product:version triple part as "python:python:3.11.9" to denote a specific instance (or rather subset) of Python, or use "python:python:*"to denote all versions of Python. That allows you to associate data, for example CVEs or licenses, with an arbitrarily precise set of software. However, doing this well is not easy, and while PURLs seem to head a bit into this direction (e.g., version ranges), they are currently not set up for such an approach
If PURL is not set up for that approach it isn't clear to me how vcpkg adopting PURL achieves that slated goal?
Just throwing in my $0.02... the purpose of this PURL is 100% to aid in matching vcpkg packages to vulnerabilities. vcpkg already provides all the specifics in different formats covering everything you may want to know about the package, except the vulnerabilities associated with it. I don't know why this was overlooked, but anyways this is what people are waiting for. Product security teams triage findings anyways, it doesn't have to be perfect, but please do make the basic functionality work.
It seems unacceptable that the same PURL can identify totally different packages in different contexts but that's exactly the situation they would get with anything close to that proposed here, as registries and overlays by design let users entirely replace what the meaning of a given port-name is.
It is likely that a given overlay of a given name claiming to be a given version is at least quasi-interchangeable with a non-overlayed one or with one from another registry, in a way which might be sufficient or useful for vulnerability matching. But the requirement in the spec is "reliably reference the same software package", and cramming everything that goes into how particular bits are chosen to be reliable is hard.
Of course, it isn't like I'm a maintainer of this repo. If PURL's owners want to add a thing, I don't think we're strenuously objecting or anything like that. It's just that, speaking personally, I don't want to endorse something that makes it look like vcpkg is claiming something it doesn't actually deliver.
Just throwing in my $0.02... the purpose of this PURL is 100% to aid in matching vcpkg packages to vulnerabilities. vcpkg already provides all the specifics in different formats covering everything you may want to know about the package, except the vulnerabilities associated with it. I don't know why this was overlooked, but anyways this is what people are waiting for. Product security teams triage findings anyways, it doesn't have to be perfect, but please do make the basic functionality work.
It seems unacceptable that the same PURL can identify totally different packages in different contexts but that's exactly the situation they would get with anything close to that proposed here, as registries and overlays by design let users entirely replace what the meaning of a given port-name is.
From a vcpkg perspective, what we need is a way to specify a package and a port. Using openssl as an example, the latest listing in vcpkg is: 3.4.1#0 from which a suitable but fake PURL would be:
pkg:vcpkg/[email protected]?port=3.4.1#0
IMO we are looking for a way to trace the package to its origin in vcpkg. Linking to this version is all that is needed, and the ability to then retrieve two things: 1. dependencies from the vcpkg site, and 2. CPEs to match with the NVD.
Conan lists the PURL like this: pkg:conan/[email protected]?repository_url=https%3A%2F%2Fcenter2.conan.io
It is likely that a given overlay of a given name claiming to be a given version is at least quasi-interchangeable with a non-overlayed one or with one from another registry, in a way which might be sufficient or useful for vulnerability matching. But the requirement in the spec is "reliably reference the same software package", and cramming everything that goes into how particular bits are chosen to be reliable is hard.
The PURL is helpful in that it lists the package manager, so in this example: pkg:vcpkg/[email protected]?port=3.4.1#0 it effectively references the source repo as vcpkg, the version, 3.4.1, and the port 3.4.1#0, so someone will reliably reference the same package.
@BillyONeal The way a PURL sneaks into the security acts is a bit roundabout:
- Act / Presidential order establishes that an SBOM must be created
- Standardization committee creates a concept for the SBOM. A Purl is part of the MUST or top of the SHOULD clauses. Example NTIA: https://www.ntia.gov/report/2021/minimum-elements-software-bill-materials-sbom Example BSI: https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Publications/TechGuidelines/TR03183/BSI-TR-03183-2-2_0_0.pdf?__blob=publicationFile&v=3
- Customers require, e.g. in a tender, that you supply an SBOM according to the concept.
- Vendors react, concept becomes de-facto state of the art.
✨✨ Here's an AI-assisted sketch of how you might approach this issue saved by @Dustin4444 using Copilot Workspace v0.27
From a vcpkg perspective, what we need is a way to specify a package and a port. Using openssl as an example, the latest listing in vcpkg is: 3.4.1#0 from which a suitable but fake PURL would be:
pkg:vcpkg/[email protected]?port=3.4.1#0
The problem is that if the user did vcpkg install --overlay-ports=some/directory/openssl, or in their manifest said {"overlay-ports": "some/directory"} then that "openssl" has no relationship whatsoever to anything maintained in our registry. It is just whatever some/directory/openssl/portfile.cmake said to do at the time.
This is why the SBOM we currently generate has the SHAs of all the files in the port rather than trying to describe some canonical 'registry' structure.
Is it likely that the person who made that overlay-port did so by copy-pasta-ing the one from our registry and changing something, and therefore there is likely to be some relationship between that and OpenSSL's sources? Sure. But can we assure that for purposes of the kind of audits described under these EOs and stuff? No. The user could put whatever they want there.
The PURL is helpful in that it lists the package manager, so in this example: pkg:vcpkg/[email protected]?port=3.4.1#0 it effectively references the source repo as vcpkg, the version, 3.4.1, and the port 3.4.1#0, so someone will reliably reference the same package.
It does not do that. That is, in fact, the problem I'm talking about here.
Sorry, but it seems to be you are in search of perfection, which doesn't exist. What we need is progress.
IMO we don't need vcpkg to cover every case. Conan doesn't cover every case. NPM doesn't, NuGet doesn't. No package manager does. We need vcpkg to support PURLs to automate vulnerability tracking.
What I am saying is, the spec request in the README here says that PURLs are expected to provide that level of perfection.
https://github.com/package-url/purl-spec/blob/8040ff0be50f0c5b1986b1a0947bd539f5405fc4/README.rst?plain=1#L62-L63
And I don't see obvious ways to provide that given the way we operate.
If some standard less than that level of perfection is acceptable, then sure, this might be fine, but I can't really speak to what is acceptable as a vcpkg maintainer in that position. That is more about the PURL spec itself and what it intends to provide to its users.
If someone who actually owns the PURL spec says "vcpkg, go emit this thing in your SBOMs" we can absolutely do that. I just can't endorse that we are meeting the claims in the README here.
==============================
As an aside, I would argue that that level of perfection is in fact what these vulnerability databases depend on. They want to be able to say "yes, we know you have this vulnerability that exists against you because you installed that package." It seems like people would be extremely angry if they get an SBOM claiming that someone installed "OpenSSL 3.0.x" when in fact we installed something from an overlay-port that built sources for OpenSSL 1.0.0a.
It seems like people would be extremely angry if they get an SBOM claiming that someone installed "OpenSSL 3.0.x" when in fact we installed something from an overlay-port that built sources for OpenSSL 1.0.0a.
Or replaced it with an empty overlay pointing either to the system library or an alternative like boringssl. I think PURL really only works for more or less closed off ecosystems.
What I am saying is, the spec request in the README here says that PURLs are expected to provide that level of perfection.
https://github.com/package-url/purl-spec/blob/8040ff0be50f0c5b1986b1a0947bd539f5405fc4/README.rst?plain=1#L62-L63
And I don't see obvious ways to provide that given the way we operate.
Thats the issue, I think part of the problem may be that vcpkg maintainers (of which I understand you are one) haven't addressed where it would go. Its a bit of a chicken and egg, and really they both need to be considered together. So referencing vcpkg explicitly, it would make the most sense if the PURL ultimately ended up in the vcpkg.spdx.json output file. But this file must be generated during compilation, so there needs to be a place to maintain it as an input. Then we are back to the discussion above about the vcpkg.json file that is a part of the port (as well as likely a part of any overlay ports as well, from https://learn.microsoft.com/en-us/vcpkg/concepts/overlay-ports).I have never seen a CONTROL file so I can't say how they are used.
If some standard less than that level of perfection is acceptable, then sure, this might be fine, but I can't really speak to what is acceptable as a vcpkg maintainer in that position. That is more about the PURL spec itself and what it intends to provide to its users.
The full section you quoted early doesn't say perfection:
`Solution`
========
A `purl` or package URL is an attempt to standardize existing approaches to
reliably identify and locate software packages.
A `purl` is a URL string used to identify and locate a software package in a
mostly universal and uniform way across programming languages, package managers,
packaging conventions, tools, APIs and databases.
Such a package URL is useful to reliably reference the same software package
using a simple and expressive syntax and conventions based on familiar URLs.
An attempt to standardize ... and locate a software package in a mostly universal and uniform way across programming languages, package managers... to reliably reference the same software package.
And the readme referenced is just a readme, the https://github.com/package-url/purl-spec/blob/main/PURL-SPECIFICATION.rst and https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst are really more definitive.
If someone who actually owns the PURL spec says "vcpkg, go emit this thing in your SBOMs" we can absolutely do that. I just can't endorse that we are meeting the claims in the README here.
Please refer to Conan in PURL-TYPES.rst as it sets the convention for C++ package managers. As that was already accepted and incorporated, I think its fair to say that sets the de facto standard for the level of perfection expected from vcpkg.
==============================
As an aside, I would argue that that level of perfection is in fact what these vulnerability databases depend on. They want to be able to say "yes, we know you have this vulnerability that exists against you because you installed that package." It seems like people would be extremely angry if they get an SBOM claiming that someone installed "OpenSSL 3.0.x" when in fact we installed something from an overlay-port that built sources for OpenSSL 1.0.0a.
Today, the alternative is manually searching on NVD or other sources using keywords to search for vulnerabilities, so the PURLs will be infinitely better.
To get a usable SBOM for my C++ vcpkg components I have to assemble it by hand (either from scratch or merging multiple SPDX files that vcpkg provides), but SBOMs are by their nature meant to be automated.
Regarding the transparency benefits of SBOMs, anyone can provide false or misleading data and you are 100% correct neither SBOMs nor PURLs will address that. Even with increased transparency there still needs to be some trust. However, trust increases when something can be automated. And, given SBOMs IMO opinion are really an effort to make sure there is transparency for package sources for the intended purpose of tracking vulnerabilities.
Thats the issue, I think part of the problem may be that vcpkg maintainers (of which I understand you are one) haven't addressed where it would go. Its a bit of a chicken and egg, and really they both need to be considered together. So referencing vcpkg explicitly, it would make the most sense if the PURL ultimately ended up in the vcpkg.spdx.json output file.
I don't really see a chicken and egg problem here. If there's a place in SPDX for PURL to go, and there's agreement here as to the form of a PURL, we would put that there.
But this file must be generated during compilation, so there needs to be a place to maintain it as an input.
What do you mean as an input? If we're generating it, that sounds like an output to me.
Then we are back to the discussion above about the vcpkg.json file that is a part of the port (as well as likely a part of any overlay ports as well, from https://learn.microsoft.com/en-us/vcpkg/concepts/overlay-ports).I have never seen a CONTROL file so I can't say how they are used.
CONTROL is just the old file format for vcpkg.json:
Name: abc
Version: vista
Description: The abc library does the blah blah.
is the same as
{
"name": "abc",
"version-string": "vista",
"description": "The abc library does the blah blah."
}
Ports and overlay-ports are the same thing; if an overlay is configured those names just become what that name means. Consider the following:
security-library/vcpkg.json
{
"name": "security-library",
"version": "1.0.0",
"dependencies": [
{
"name": "openssl",
"version>=": "1.1.1n"
}
]
}
security-library/portfile.cmake has instructions to download sources for and build security-library.
my-fancy-overlays-directory/vcpkg.json
{
"name": "openssl",
"version-string": "1.0.0a"
}
my-fancy-overlays-directory/portfile.cmake
message(STATUS "This port does absolutely nothing, hope your system has an openssl in /usr already 🙃")
and a vcpkg instance at https://github.com/microsoft/vcpkg/commit/23b33f5a010e3d67132fa3c34ab6cd0009bb9296
A classic mode install like vcpkg install security-library would (in a roundabout way) invoke https://github.com/microsoft/vcpkg/blob/23b33f5a010e3d67132fa3c34ab6cd0009bb9296/ports/openssl/portfile.cmake thus downloading the sources for openssl 3.4.1 before attempting to install security-library.
As proposed here, I believe that would result in pkg:vcpkg/[email protected]?port_revision=0&repository_url=https://github.com/microsoft/vcpkg&repository_revision=23b33f5a010e3d67132fa3c34ab6cd0009bb9296. Maybe it would be more accurate to refer to the git_tree rather than the commit SHA, but this will work. This is probably acceptable / close enough for some of the NVD use cases described here.
A classic mode install like vcpkg install security-library --overlay-ports my-fancy-overlays-directory would not name anything in vcpkg's registry at all, and would only invoke my-fancy-overlays-directory/portfile.cmake to "install" OpenSSL. There is nothing security-library can do to ask for something else, it does not matter that it asked for "version>=": "1.1.1n" despite the overlay claiming to be (though not actually installing) 1.0.0a. Overlays are an absolute "this name means that thing, I will take no questions or comments about it, versioning or attempting to find that name elsewhere be damned". Overlays do not exist in any form of registry structure. They're 'whatever happened to be on the build machine in that directory at the time'.
As proposed here, I believe that would result in pkg:vcpkg/[email protected]?port_revision=0. Is that acceptable given that we didn't actually install anything? I'm not bringing this up as a pathological case; overlaying something with "empty" for dependencies one wants to get from one's system package manager rather than from us is fairly standard procedure (though in such cases it would be more common to declare such an empty overlay with "version-string": "the system" or something like that).
An attempt to standardize ... and locate a software package in a mostly universal and uniform way across programming languages, package managers... to reliably reference the same software package.
To me, 'reliably' is what asks for perfection here, and 'attempt' and 'mostly' refer to that given fields that exist in the PURL won't be universal given that different package management systems use different language to describe the relationships contemplated here.
It seems clear based on feedback here that less than that level of reliability is acceptable, but I'm not comfortable "guessing" at what is acceptable. This is why I tried to explore what is really intended with the litmus tests I proposed on 2023-09-05: https://github.com/package-url/purl-spec/pull/245#issuecomment-1707421218
And the readme referenced is just a readme, the https://github.com/package-url/purl-spec/blob/main/PURL-SPECIFICATION.rst and https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst are really more definitive.
Unfortunately, both of these discuss only the 'form' of PURLs, not the requirements expected of them.
Please refer to Conan in PURL-TYPES.rst as it sets the convention for C++ package managers. As that was already accepted and incorporated, I think its fair to say that sets the de facto standard for the level of perfection expected from vcpkg.
I somewhat disagree. While vcpkg and conan both speak to C++ customers, the mechanism we use as 'sources of truth' of what a given package name means differs significantly. Even if it didn't, "the previous engineer signed off on the plans for this bridge" is not grounds to sign off on the plans for the bridge.
Regarding the transparency benefits of SBOMs, anyone can provide false or misleading data and you are 100% correct neither SBOMs nor PURLs will address that. Even with increased transparency there still needs to be some trust. However, trust increases when something can be automated. And, given SBOMs IMO opinion are really an effort to make sure there is transparency for package sources for the intended purpose of tracking vulnerabilities.
My concern is that I don't know how, within the bounds of how vcpkg works and the constraint that this thing has to look like a URL at the end, to generate something that provides a meaningful level of transparency. Maybe if certain qualifiers are added like "we know the port came from a git registry" we can at least describe where the build script came from, but that same build script run on two different computers produces two different packages and it would be quite easy for one to be vulnerable and one to not be, for example if those two machines had compilers with different mitigations.
Rather than continuing to argue about what reading means what in various discussions here, maybe a more productive path forward on this discussion would be one of the PURL maintainers joining the discussion, we the vcpkg maintainers describe the contours of what works and what does not work given the constraints of what can fit into a URL and what our system 'knows'. If they want to make the call that that's acceptable, then great! I just feel like endorsing anything when I know reasonable and concrete ways it falls apart would be putting words in the PURL maintainers' mouths, and I am unwilling to do that.
(I'm trying to be careful to describe my thoughts vs. 'the maintainers'' thoughts, I apologize for any confusion here. The vcpkg team/maintainers overall haven't really had much of a discussion about this)
@BillyONeal you have given some thoughtful responses, thank you.
I don't really see a chicken and egg problem here. If there's a place in SPDX for PURL to go, and there's agreement here as to the form of a PURL, we would put that there.
Forgive my ignorance as a security practitioner not a developer. My use case is to find executables in the build output directory, then scan the share folders (vcpkg_installed/xyz-windows/xyz-windows/share) and collect the spdx files to assemble the SBOM from. I can do this today but it doesn't help me with identifying vulnerabilities, which is why the PURL is important.
I also expected the PURL to be in the output vcpkg.spdx.json. I am certainly oversimplifying, but given the above overlay examples, would this really create a folder vcpkg_installed/x86-windows/x86-windows/share/openssl (egg) with the openssl SPDX containing the PURL string (chicken) in it?
Excerpt from a theoretical vcpkg_installed/x86-windows/x86-windows/share/openssl/vcpkg.spdx.json I would expect to hopefully see someday:
"packages": [
{
"name": "openssl",
"SPDXID": "SPDXRef-port",
"versionInfo": "3.4.1",
"downloadLocation": ...
"homepage": "https://www.openssl.org",
"licenseConcluded": "Apache-2.0",
"licenseDeclared": "NOASSERTION",
"copyrightText": "NOASSERTION",
"description": "OpenSSL is an open source project...",
"comment": "This is the port (recipe) consumed by vcpkg."
"packageUrl":"pkg:vcpkg/[email protected]?..."
}
If the executable isn't in the build output and the share folder and spdx are not in the vcpkg_installed folder, I don't think the overlay is a concern.
Again, thank you for the input and hearing me out.
I also expected the PURL to be in the output vcpkg.spdx.json. I am certainly oversimplifying, but given the above overlay examples, would this really create a folder vcpkg_installed/x86-windows/x86-windows/share/openssl (egg) with the openssl SPDX containing the PURL string (chicken) in it?
Those are both chickens or both eggs, as vcpkg writes both of them. We would just add the PURL somewhere around here: https://github.com/microsoft/vcpkg-tool/blob/c99f8ca02a1199abd269cfa04c9e79443cd8ee2e/src/vcpkg/spdx.cpp#L295
If the executable isn't in the build output and the share folder and spdx are not in the vcpkg_installed folder, I don't think the overlay is a concern.
What if it is in the vcpkg_installed folder? I showed an empty overlay here but a non-empty overlay is just as possible.
A classic mode install like
vcpkg install security-library --overlay-ports my-fancy-overlays-directorywould not name anything in vcpkg's registry at all, and would only invokemy-fancy-overlays-directory/portfile.cmaketo "install" OpenSSL. There is nothingsecurity-librarycan do to ask for something else, it does not matter that it asked for"version>=": "1.1.1n"despite the overlay claiming to be (though not actually installing) 1.0.0a. Overlays are an absolute "this name means that thing, I will take no questions or comments about it, versioning or attempting to find that name elsewhere be damned". Overlays do not exist in any form of registry structure. They're 'whatever happened to be on the build machine in that directory at the time'.As proposed here, I believe that would result in
pkg:vcpkg/[email protected]?port_revision=0. Is that acceptable given that we didn't actually install anything?
The proposal here is that it results in pkg:vcpkg/[email protected]?repository_url=file://my-fancy-overlays-directory&port_revision=0. This has the following properties:
- Basic matching of purls will still match openssl version 1.0.0a. I think that is a feature we want, because more often than not, this port overlay will have something to do with openssl version 1.0.0a.
- Exact matching of purls will say: "I don't know, we have nothing to claim for repositories starting with
file://".
@BillyONeal Does that address your concerns around overlays? Maybe the purl spec for vcpkg should say that the repository_url qualifier is required in case of port overlays and different repositories?