purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

Version range

Open iamwillbar opened this issue 4 years ago • 29 comments

Is there desire for PURL to support version ranges or is that out of scope? For example, to describe vulnerable versions of a package.

iamwillbar avatar Oct 16 '19 18:10 iamwillbar

If nothing else, capturing the various ecosystem version sorting is really key. Then you could at least use purl ids as upper and lower bounds and still evaluate what's in between.

brianf avatar Oct 16 '19 19:10 brianf

@iamwillbar there is a need a alright to have a common way to express version ranges... but I wonder if this is possible, because there is no universal way to express versions. Semver comes close (but cannot handle some epochs or Debian "or" AFAIK)

@brianf There is no universal way to compare versions too (which is why RPM- and Debian-based distros had to adopt a concept of epoch) so documenting the way things are compared for each package type would be great as a start.

@iamwillbar if you were to provide some unified specification for version ranges what would it look like? (leaving aside for now if this could be stuffed in a PURL or not)

pombredanne avatar Nov 25 '19 15:11 pombredanne

@pombredanne I suppose nothing is preventing purl users from specifying the versioning scheme in a qualifier, e.g. pkg:pypi/[email protected]?version_scheme=semver. Given a set of these, you could order them and determine vulnerable versions by comparing against a known fixed version. This however falls apart as soon as you have multiple supported version streams that are updated independently (as is the case for Django). There are a ton of "standards" on version range specifications:

  • https://maven.apache.org/enforcer/enforcer-rules/versionRanges.html
  • https://semver.npmjs.com/
  • https://www.python.org/dev/peps/pep-0440/#compatible-release

If we were to use the Python one, and use this security release as an example, the purls for the released versions could be:

Mandatory qualifiers are not a thing in the spec however, so this would solely depend on maintainers of said projects to use them.

mprpic avatar Sep 28 '20 18:09 mprpic

@mprpic

I suppose nothing is preventing purl users from specifying the versioning scheme in a qualifier, e.g. pkg:pypi/[email protected]?version_scheme=semver.

Sure thing nothing prevents this, and we could even make it part of the spec too. Yet as you mentioned this falls apart as soon as you have multiple supported version streams that are updated independently. If there were anything that would be practical, that would be to adopt the many semantics of how each package type handles versions constraints for dependencies which is often more complex than just a scheme.

That said, in the context in the context vulnerability reporting is there really a need, value an correctness to use a version range? I started wondering about this when @sbs2001 mentioned it in https://gitter.im/aboutcode-org/vulnerablecode?at=5f70857f5a56b467a5f2a835

At a point in time I can state that a list of concrete and discrete versions (not a range but a list) are subject to a certain vulnerability, and that there is a list of concrete and discrete versions (not a range but a list) in which that vulnerability has been patched/resolved fixed.

This could be a list of Package URLs or a list of versions, not ranges. Anything that uses a range or some wildcard is either potentially incorrect or misleading or both, which to me makes the range value both low and/or dangerous. And this is likely even more so when looking as distro packages such as RPM or Debian packages that would add patch numbers to the upstream version scheme with a releases/build number and or epoch and the affected versions would rarely resolve to a proper range but always be correct when using a list.

Does this make some sense? What would be the benefits of a range?

pombredanne avatar Sep 28 '20 20:09 pombredanne

@mprpic this thread has a good argument from @copernico for the need of version ranges https://gitter.im/aboutcode-org/vulnerablecode?at=5f7231fa6e85e0058c5f4aaf

pombredanne avatar Sep 29 '20 06:09 pombredanne

FYI I compiled a list of many version specs here https://github.com/nexB/vulnerablecode/issues/140#issuecomment-707678291

  • Rubygems https://guides.rubygems.org/patterns/#semantic-versioning
  • node-semver as used for npms https://github.com/npm/node-semver#ranges
  • Python https://www.python.org/dev/peps/pep-0440/
  • Debian and Ubuntu https://www.debian.org/doc/debian-policy/ch-relationships.html
  • RPM distros https://rpm.org/user_doc/dependencies.html#versioning and https://fedoraproject.org/wiki/Archive:Tools/RPM/VersionComparison
  • Perl https://perlmaven.com/how-to-compare-version-numbers-in-perl-and-for-cpan-modules
  • of course NVD CPEs https://nvd.nist.gov/General/News/CPE-Range-Notification
  • Apache maven http://maven.apache.org/enforcer/enforcer-rules/versionRanges.html
  • NuGet https://docs.microsoft.com/en-us/nuget/concepts/package-versioning
  • Apache and Nuget following more or less math intervals https://en.wikipedia.org/wiki/Interval_(mathematics)
  • Gentoo https://wiki.gentoo.org/wiki/Version_specifier
  • Alpine linux https://gitlab.alpinelinux.org/alpine/apk-tools/-/blob/master/src/version.c (which might be using Gentoo conventions)
  • Go https://golang.org/ref/mod#versions which uses semver with some twists

And we are going to run some concrete experiment with a "universal" version range syntax in https://github.com/nexB/vulnerablecode/issues/140#issuecomment-712115527 and will report back here

pombredanne avatar Oct 19 '20 14:10 pombredanne

Now here are some aesthetic considerations:

A purl pkg:npm/foo with a (complex) version_range ~= 0.9, >= 1.0, != 1.3.4.*, < 2.0 defined using the PEP-440 syntax and used as qualifier would come out once encoded as:

pkg:npm/foo?version_range=%7E%3D%200.9%2C%20%3E%3D%201.0%2C%20%21%3D%201.3.4.%2A%2C%20%3C%202.0

Hum :unamused:

pombredanne avatar Oct 19 '20 14:10 pombredanne

Ugh, I figured it would make purl version ranges unreadable. If ranges will be included, I don't see a way to eliminate the aesthetic problems it creates. The only thing I can think of is maybe to break down the version clauses into individual purl qualifiers rather than a single version_range string. That would likely make the purl readable without as much encoding.

stevespringett avatar Oct 19 '20 15:10 stevespringett

Hey folks, looks like this issue has gone stale, but I'd love to restart the conversation. @pombredanne 's suggestion seems entirely reasonable, even if the URL encoding of the characters makes it less human readable.

jhutchings1 avatar Oct 28 '21 23:10 jhutchings1

@jhutchings1 the issue has not gone stale at all ... it is just that we are making practical experiments with a version range separately in another repo!

There is a draft spec there (it would need to be extracted and brought here) in there: https://github.com/nexB/univers/blob/386eb32468c75ecac25ec872ea004b3257962946/VERSION-RANGE-SPEC.rst

ATM the draft starts to have some legs ... but I need to play with actual real working to validate that this can work at scale. The WIP experimental code in https://github.com/nexB/univers needs to be beefed up to match the spec and tested with vulnerablecode.... but another implementation would be welcomed too!

pombredanne avatar Oct 29 '21 14:10 pombredanne

@jhutchings1 actually I separated a draft spec in a clean branch here https://github.com/nexB/univers/pull/11

pombredanne avatar Oct 29 '21 16:10 pombredanne

/me takes deep breath

i'm gonna break a personal and OSS best practice rule and spread unsupported FUD. Sorry. I'm only doing it because i see concrete progress being made here, and i think not saying something may be more harmful.

i've come to believe that version ranges are, in general, harmful. i do have an alternative that i've been working on for a while - it's not public because it's unfinished, but the relevant bits are plausibly finished enough for purl and the purposes of this discussion. @pombredanne, it's been quite a while since we've talked (FOSDEM 2018, right?), but if you want to catch up about it, i'd be happy to find an hour sometime - DM me on twitter?

Totally understood that my unspecified, general concern should not block actual progress, though.

sdboyer avatar Nov 30 '21 11:11 sdboyer

Maybe playing MisterObvious, but IMHO the issue is not version range or version list: the root cause of our headaches is version computability (operators ==, <, >, <=, >=), and while it is more or less OK for Semver or CalVer, except for some wildcards and attributes corner cases, it is indeed a tough one.

When the NVD introduced the new way of defining ranges with

  • versionStartIncluding
  • versionStartExcluding
  • versionEndIncluding
  • versionEndExcluding ...in JSON CVE Schema 0.1_beta - 2017-11-01 in a very discreet way, some tools that relied on a full list of impacted version for CVE broke, and to make matter worst broke silently. The one I used at the time took 2 years to fix it (I found a better one in the meantime).

I'd suggest extreme care: the NVD people have been working on software inventory for 20 years, are not stupid, and yet kind of failed (at least for the use cases we now have).

There has been a very long discussion on the topic in the upcoming CVE JSON Schema development, now in v5.0.0 release candidate 5. I can't find the exact the exact discussion reference back, but as of today on CVE / CPE side the outcomes are there: schema/v5.0: introduce computable version ranges Merge pull request #100 from rsc/computable-versions

jbmaillet avatar Nov 30 '21 13:11 jbmaillet

@jbmaillet re:

Maybe playing MisterObvious, but IMHO the issue is not version range or version list: the root cause of our headaches is version computability (operators ==, <, >, <=, >=),

yes it is! and a range in any notation demands to be informed by how two versions are compared.

In the experimental spec at https://github.com/nexB/univers/pull/11 for a compact range notation and in the WIP companion working implementation at https://github.com/nexB/univers/tree/main/src/univers by @sbs2001

  • the range syntax or notation is unified and shared by any package type and versioning scheme
  • the version comparison semantics are unique for versioning scheme (which practically equals to a package type)

The WIP spec has extensive research on the topic when used for vulnerable ranges, including the NVD approach, but also when used for package dependencies ranges.

The NVD versionStartIncluding, versionStartExcluding, versionEndIncluding, versionEndExcluding are missing one important data piece which is how to compare two versions as greater or lesser which is something that has been integrated as versionType in https://github.com/CVEProject/cve-schema/commit/e3d43c6df4a571b2f4b469ad06a6f45ca13856c6 and the related https://github.com/ossf/osv-schema spec

The draft "vers" specs tries to address this with a slightly different goal to have a compact yet obvious notation for version ranges. Practically the companion "univers" Python library at https://github.com/nexB/univers/tree/main/src/univers relies on multiple package-type/ecosystem specific comparison functions: these include for now node-semver (as used in npm), rpms, maven, debian, gentoo, arch, semver, ruby and even one which is specific to a single package for nginx that have their own peculiar way to define vulnerable ranges in their advisories (see https://github.com/nexB/univers/blob/7a99ab9288ff8e20bcc69b4b383015be6615c2b9/src/univers/version_range.py#L375 )

At this stage it makes sense that I move the draft at https://github.com/nexB/univers/pull/11 to a PR here as the two are closely tied! :)

pombredanne avatar Nov 30 '21 14:11 pombredanne

@jbmaillet and everyone here ... See https://github.com/package-url/purl-spec/pull/139 .... comments are badly needed.

pombredanne avatar Nov 30 '21 15:11 pombredanne

the root cause of our headaches is version computability (operators ==, <, >, <=, >=)

Indeed. But,

while it is more or less OK for Semver or CalVer

i'd say "less," at least for semver, where the tendency is to construct ranges with bounds on versions that may not yet exist - and even if they do, versions may come to exist after the publication of the range.

But, i see this comment https://github.com/CVEProject/cve-schema/issues/87#issuecomment-904398866, particularly:

I came to appreciate that version ranges can only ever be an approximation; and that a complete enumeration of all affected versions is the only correct statement

and suspect that if you're embracing ranges while having accepted this, then the additional things i could add will be of marginal value, which is OK. /me bows out

sdboyer avatar Nov 30 '21 15:11 sdboyer

I'd agree that version ranges as a mechanism for choosing dependencies is generally bad. (Hence why things like LATEST and RELEASE were deprecated in Maven 3 years ago). However for this spec, we still need a way to express ranges, eg this vulnerability applies to versions x to y. IOW ranges are required to be expressive generally, but using them to declare dependencies is a bridge too far...but I don't think that's the point of the spec here.

brianf avatar Nov 30 '21 16:11 brianf

@sdboyer Hey! it has been a while.... great to for you to drop by! Let me ping you on twitter. I am pombr there

i've come to believe that version ranges are, in general, harmful.

I agree ++. I am intrigued by what your alternative could be!

Eventually ranges are all leaky and make false promise at some level and in practice, only full enumerations might be correct yet ... they do exist in the wild and capturing the wild beasts is what I want somehow.

@brianf re:

However for this spec, we still need a way to express ranges, eg this vulnerability applies to versions x to y. IOW ranges are required to be expressive generally,

Exactly.

but using them to declare dependencies is a bridge too far...but I don't think that's the point of the spec here.

The spec does not take a stand on how ranges would be used and it could be used to depict vulnerable or dependent ranges. I have a question though wrt. Maven: how common would you say using ranges are? https://maven.apache.org/pom.html#Dependency_Version_Requirement_Specification

pombredanne avatar Nov 30 '21 18:11 pombredanne

Ranges in Maven are very rarely used. Going way back to the start of Maven 2 it was understood that it was an anti-pattern with limited use cases and that ultimately was memorialized with the Maven 3 changes I referenced above. Build reproducibility was the key reason to discourage ranges back in the day.

On Tue, Nov 30, 2021 at 1:32 PM Philippe Ombredanne < @.***> wrote:

@sdboyer https://github.com/sdboyer Hey! it has been a while.... great to for you to drop by! Let me ping you on twitter. I am pombr there

i've come to believe that version ranges are, in general, harmful.

I agree ++. I am intrigued by what your alternative could be!

Eventually ranges are all leaky and make false promise at some level and in practice, only full enumerations might be correct yet ... they do exist in the wild and capturing the wild beasts is what I want somehow.

@brianf https://github.com/brianf re:

However for this spec, we still need a way to express ranges, eg this vulnerability applies to versions x to y. IOW ranges are required to be expressive generally,

Exactly.

but using them to declare dependencies is a bridge too far...but I don't think that's the point of the spec here.

The spec does not take a stand on how ranges would be used and it could be used to depict vulnerable or dependent ranges. I have a question though wrt. Maven: how common would you say using ranges are? https://maven.apache.org/pom.html#Dependency_Version_Requirement_Specification

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/package-url/purl-spec/issues/66#issuecomment-982907065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPWCFRGZD36J2VXUE5HM3UOUKDJANCNFSM4JBPE3GA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

brianf avatar Nov 30 '21 18:11 brianf

@brianf re:

Ranges in Maven are very rarely used.

Thanks... this confirms my impression. As an aside, it's funny that dependency ranges have been mostly abandoned by Maven, yet are fairly prevalent in Python, npm and Ruby package manifests, commonly accompanied by an extra full enumeration of pinned versions a.k.a. a lockfile.

pombredanne avatar Nov 30 '21 19:11 pombredanne

Maven takes the stance that you should be locking by default, with tooling to make updates when you want/need to. Other systems take the opposite approach which is why you see the prevalence of lockfiles to achieve the same thing.

On Tue, Nov 30, 2021 at 2:18 PM Philippe Ombredanne < @.***> wrote:

@brianf https://github.com/brianf re:

Ranges in Maven are very rarely used.

Thanks... this confirms my impression. As an aside, it's funny that dependency ranges have been mostly abandoned by Maven, yet are fairly prevalent in Python, npm and Ruby package manifests, commonly accompanied by an extra full enumeration of pinned versions a.k.a. a lockfile.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/package-url/purl-spec/issues/66#issuecomment-982942093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPWCCWIN62RPRKCUJSXJTUOUPRTANCNFSM4JBPE3GA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

brianf avatar Nov 30 '21 19:11 brianf

(This is a long rant, but you can jump to the conclusion.)

To give a bit of context to my comments, and self introducing: I work in the IoT/embedded field, for automotive systems, cybersecurity (plus a bit of OSS licensing). That's Linux, Android, AUTOSAR, FreeRTOS as environments, and a SLOC count of 95% of C/C++ the rest being Java or Kotlin. Everything is built from sources, either from archives + good old autotools or from plain git repositories (think a la AOSP or buildroot or Yocto).

The source code is often a fork from upstream, at least for the Linux kernels (the SoC vendors fork the kernel, and we fork it again for our own customization, or for some CVE or plain bug backports because upgrading is extremely painful in such context). Last time I checked, the Linux kernel on an LTS branch such as those we use has 13000+ Kconfig options: in a typical industrial product, only 20% of the code is actually compiled (and hence only roughly 20% of the CVE are relevant, for example).

An Android source tree (which is more than AOSP, because AOSP does not come with a kernel for your SoC CPU nor bootloader nor hypervisor etc) + our added OSS and our proprietary code is about 100GB before build, with about 800 git repositories (AOSP for Android 12, without kernel nor added customization, consist of 1079 git repositories as of today). And it is only part of a system/product, and of course we have several systems/products.

As a result, a complete system/product will have about 1000 CVE to track (yes, a thousand), 95% of which being false positives (code not compiled per build configuration, fix backported, but mostly poorly document CVE, more on this latter) in hundreds of reposirories. Plus our suppliers private advisories etc.

In this context, I use CVE (and other sources), and so I'm stuck with CPE for now, ready to jump to SWID when they will be used by the NVD. I do not (yet) use purl, nor SPDX, both by lack of need and by lack of spare time (also, we have our own internal tooling and processes in place). But I consider any effort in software inventory, version computation/matching, dependency tracking, SBOM as important for my securities assesments and OSS licensing compliance , and I try to follow the development in these areas. So BTW: thank you all for your work.

This being said:

  • projects like univers are, sadly, not usable for me because my ecosystem is C/C++ bare metal code built from sources

  • @sdboyer "a complete enumeration of all affected versions is the only correct statement", yes, I 100% agree, but this does not work, never did, and I'm afraid never will. Take the Linux kernel. Right now, there are 7 branches developed and maintained (4.4, 4.9, 4.14, 4.19, 5.4, 5.10, 5.15). Plus the old product that are still in use in some car on the road (product serial life). New releases / tags of these 7 branches are made every week (RSS feed here). Now let's pick up randomly one of my thousand CVE: CVE-2021-43057. As you can see, it is just documented as "Up to (excluding) 5.14.8". In peculiar it does not:

    • give the status for each of the other 6 current branches (and this is very rarely done, maybe 1% of the kernel CVE have this information details)
    • says in which version(s) the issue was introduced

And this is not done because it is too much of an analysis workload... so the this workload is transferred to auditors/analysts like me.

Considering the kernel is by far my biggest volume and flow of continuously incoming CVE, that most kernel maintainer don't care about CVE (some of them even making it a personal matter), that this situation has always been so (even when versions where - partially - listed in the NVD before the UpToExcluding etc syntax), the kernel organization or Linux foundation is not and does not want to be a CNA a full enumeration of versions will never, ever, work. Version range is the "least worst" option.

Google with Android is even worst: they put all there CVE in a unique CPE o:google:android, and good luck to track in which of its 800-1000 repositories the issue is if you are not an official Google partner with privileged access to their bulletins, only relying on the public ones.

CONCLUSION:

Don't get me wrong: a full version list could work in theory, it would be suitable and great, but it does not match the industrial reality. And it's in great part a question of people and organizations, not a question of specification. So computable version range are hard, but they are a MUST. You can enumerate version for a libfoobar that has new CVE once per month or quarter, but this does not matter if you do not address code base such as the kernel, with as of today more than 500 CVE on its 4.14 branch, had close to 2500 CVE in all its history, or Android (close to 3800 CVE in all its history) and new ones coming every weeks: these are (some) of the hard cases to address I deal with daily, I imagine there are others in other ecosystems/industry, I seen hundreds of CVE on Windows/Oracle/Citrix/IT products or Jenkins/Atlassian/tooling passing every week.

Sorry for this long rant. At least, if you never work in the embedded field, now you know why "the S in IoT is for Security". ;-)

jbmaillet avatar Dec 01 '21 17:12 jbmaillet

@jbmaillet re https://github.com/package-url/purl-spec/issues/66#issuecomment-983847094

This makes all sense and I agree saying this is a mess is an understatement!

I am rather familiar with contexts similar to yours and this brings a question (and possibly something we could craft into some project): assuming that you can efficiently determine and trace the subset of kernel code that you use in a given build , what is the minimum you would need to be able to sort CVEs there? Would knowing the fixing commit (and therefore a fixing patch) be enough as a first pass to determine if the built code subset contains the fixable code? I feel that you are likely solving an important problem and that there may be a way to pull and pool energies to fix this together (probably elsewhere, not in purl proper)

(side note: I have somewhat efficiently used strace to trace kernel and full Android devices builds to find out which code subset is baked into a built with https://github.com/nexB/tracecode-toolkit )

pombredanne avatar Dec 09 '21 11:12 pombredanne

@pombredanne , in my experience, on the kernel which is my hard case, there are:

  • 80% of CVE which are false positive per build configuration: buggy file(s) simply not compiled.
  • Between 20 to 50% of CVE which are false positive because either you already have the fix in your history, or you (or the upstream) backported the patch, depending on close/far you are from the tip of a given branch.

These 2 sets of course overlap.

CVE documentation is most of the time terrible, but you can still cross-leverage on it to help the situation.

Firs the build configuration: it is very easy and build agnostic to generate a compilation database using for example a tool such as Bear (don't get blocked on the Clang aspect: it works as well with regular or cross gcc) (same for the CMake things: it works fine with good old GNU make or totally alien build systems such as Android with its ninja/soong). The only limitation is C/C++. Note that there are tools similar to Bear for other languages / ecosystems.

It is also very easy to get a list of files implied in a CVE, if they are mentioned in the CVE description as is often the case for the kernel, just by using good old regexp.

Then you cross both sources of information and voila: you know if a file was compiled or not, and hence if the CVE is relevant or not. 80% of kernel CVE automatically sorted out as false positives.

Then for the fixes and backports: The kernel does not mandate mentioning a CVE Id in a commit, so this is unusable[*]. But there is an official kernel documentation for a backport formalism in the git commit message. It is easy to spot CVE references which include a full git sha1, again using regexp. So you get the git sha1 fixe(s), you explore your git history searching for either the sha1 "as is" or the "sha1 as a backport" and voila, 20-50% of false positives automatically CVE sorted out.

There are a few corner cases in both passes, but you get the idea. Also note that this can be used on other pieces of software, either similarly highly configurable, or that use the same backport formalism (for example GNOME GLib).

The information still missing, is when / in which version was the bug first introduced? Some people do it, without giving the full details.

PS:

(side note: I have somewhat efficiently used strace to trace kernel and full Android devices builds to find out which code subset is baked into a built with https://github.com/nexB/tracecode-toolkit )

Some colleagues and I used such an strace strategy circa 2011 for OSS and commercial licensing compliance, which was our concern at the time, with good results. But there are other and better technologies now, we would do it differently (see above for Bear as an example).

*: It makes sense not to mandate a CVE Id in a commit. For example you might not get a CVE Id yet if your are not a CNA (which the kernel should be!) and still would want to do the fix. My metric figures show that since 2002, CVE Id have never been mentioned in more than 25% of the cases, topping in CVE-2013-NNNN. In August, is was around 15% for CVE-2021-NNNN.

jbmaillet avatar Dec 10 '21 08:12 jbmaillet

@pombredanne What is the current status of https://github.com/package-url/purl-spec/blob/version-range-spec/VERSION-RANGE-SPEC.rst? Ready for use? :thinking:

tschmidtb51 avatar Jan 26 '22 21:01 tschmidtb51

@tschmidtb51 I am pretty satisfied with it at this stage. Unless there are objections I will likely merge it this week.

pombredanne avatar Feb 01 '22 14:02 pombredanne

@tschmidtb51 I am pretty satisfied with it at this stage. Unless there are objections I will likely merge it this week.

@pombredanne In general, I like the approach. I flagged some details, where I think the spec should be improved for clarity and the benefit of simplicity (e.g. prohibit consecutive pipes and empty <version-constraint>).

tschmidtb51 avatar Feb 02 '22 09:02 tschmidtb51