grype-db 🪄 Schema version 6 wishlist

As these are implemented, please edit this field to include the PR that implements it within the wishlist below:

[x] ~Consider switching json column processing to bson or another format that is space-efficient and potentially more performant to parse~ json compresses better than binary data, and we already have a compressed archive, so no benefit of changing this
[ ] Drop the version_constraint and version_type columns and consolidate into the new package_qualifiers column
[ ] Remove the id table
[ ] Remove the RecordSource field from vulnerability metadata (indicates the feed group, which is now fictitious)
[ ] Be able to show disputed state for vulnerability without conflating with fixed state
[ ] Express and distinguish NVD vs CNA CVSS scores (but keep both)
[ ] Allow for different vulnerability formats to be stored in the DB at once (e.g. OSV + original format)
[ ] Consider a table per provider approach
[ ] Capture dates where available (published, updated, withdrawn)
[ ] Capture CISA info from NVD records (NVD should probably have it's own dedicated table)
[ ] Capture the built timestamp per provider workspace (already available from the vunnel metadata, just needs someplace to live in grypedb)
[ ] We want to be able to know what fix in information is available for any record related to a match (even downstream , anchore/grype#1329 and https://github.com/anchore/grype/issues/236)
[ ] Want to be able to correlate / search records by PURL (not only CPEs) https://github.com/anchore/grype/issues/1267#issuecomment-1583571000
[ ] maven sha to groupid/artifactid lookup\
[ ] some way to capture the affected function calls when available from a source (https://github.com/ossf/osv-schema/issues/127)
[ ] retain more details about namespaces so lookups from available information in grype can be more robust when some data is omitted (https://github.com/anchore/grype-db/issues/108#issuecomment-1684428757)
[ ] retain not-affected entries somewhere so they can be used for negative match lookups in grype (https://github.com/anchore/grype-db/issues/108#issuecomment-1668474352)
[ ] GHSA entries should have "title" and "description" fields (right now only "description" column exists, and contains titles of GHSA entries), see https://github.com/anchore/grype/issues/1069
[ ] GHSA entries should have "published at"
[ ] Re-orient vulnerability records for non-os packaging ecosystems around the package type and registry? Currently they are oriented around language, but that can cause problems when a singe package registry is shared by multiple language ecosystems and also can get confusing when a single language has many packaging ecosystems (we can support the second case in the current model, but it is probably clearer to re-orient)

Focuses:

look at space savings for uncompressed DB while keeping matching speed a priority

May 24 '23 13:05 wagoodman

This wouldn't save space, but currently when RHEL and Mariner feeds report a package as "not affected" we just drop the record in vunnel. It would be helpful if this was instead expressed in the database. (Might overlap with the disputed requirement mentioned above.)

Aug 07 '23 19:08 willmurphyscode

v6 should have some way of looking up the correct namespace off of something more than just version. For instance, we're already maintaining mappings in vunnel of codenames to versions for ubuntu and debian and it would be useful to retain that information in the db itself so that grype can still find the correct vuln namespace even if one of the pieces of information might be missing (like in https://github.com/anchore/grype/issues/1446)

Keeping it in the db would mean it could be updated automatically as new namespaces are added and grype could make use of it for lookups immediately whereas maintaining a static mapping in grype means we'd need to remember to maintain that mapping in multiple places (vunnel and grype), and users would need to upgrade to the latest grype for newer namespaces

Aug 18 '23 20:08 westonsteimel

The ability to know when a particular record was added or modified within the grype database came up in a community discussion. Although we are currently always building up the database from scratch, I think it may make sense to include something like an added, and modified timestamp column to each record in the db so that if/when we add diffing/partial updates like in https://github.com/anchore/grype-db/issues/143 we'd already have the necessary columns in place and not necessarily need another schema bump at that time

Sep 13 '23 09:09 westonsteimel

Updated the top comment to point to https://github.com/anchore/grype/issues/1498 as a specific issue for "Capture dates where available".

Sep 18 '23 15:09 willmurphyscode

We need the ability to represent a CVE that affects different packages, but has different severity ratings for each package. As a concrete example, https://security-tracker.debian.org/tracker/CVE-2023-44487 lists a table with multiple severities for different packages. Here's a snippet of the relevant table on that page, in case it changes or moves:

Package	Type	Release	Fixed Version	Urgency	Origin	Debian Bugs
h2o	source	buster	2.2.5+dfsg2-2+deb10u2		DLA-3638-1
h2o	source	(unstable)	2.2.5+dfsg2-8			1054232
haproxy	source	(unstable)	1.8.13-1
jetty9	source	buster	9.4.50-4+deb10u1		DLA-3641-1
jetty9	source	bullseye	9.4.50-4+deb11u1		DSA-5540-1
jetty9	source	bookworm	9.4.50-4+deb12u2		DSA-5540-1
jetty9	source	(unstable)	9.4.53-1
netty	source	(unstable)	(unfixed)			1054234
nghttp2	source	buster	1.36.0-2+deb10u2		DLA-3621-1
nghttp2	source	(unstable)	1.57.0-1			1053769
nginx	source	(unstable)	1.24.0-2	unimportant		1053770
tomcat10	source	bookworm	10.1.6-1+deb12u1		DSA-5521-1

Note the "urgency" is marked as "unimportant" on the row for nginx. This rating translates in Vunnel to "negligible" in grype-db.

Currently, in the database, this CVE is represented like this:

-- VULNERABILITY TABLE
sqlite> select id, package_name from vulnerability where 
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id              package_name
--------------  -------------
CVE-2023-44487  h2o
CVE-2023-44487  haproxy
CVE-2023-44487  jetty9
CVE-2023-44487  netty
CVE-2023-44487  nghttp2
CVE-2023-44487  nginx
CVE-2023-44487  tomcat10
CVE-2023-44487  tomcat9
CVE-2023-44487  trafficserver
-- VULNERABILITY_METADATA table
sqlite> select id, severity from vulnerability_metadata where 
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id              severity
--------------  ----------
CVE-2023-44487  Negligible

As you can see, the database has no good way of writing down, "this CVE is more severe if matched against tomcat than against nginx," but Debian's data is clearly trying to tell us that.

I believe this would be fixed by having a proper foreign key from vulnerability to vulnerability_metadata, rather than just relying on ID+Namespace to match. We could also move the severity column to the vulnerability table, but that would probably result in a lot of duplicate values.

Nov 06 '23 20:11 willmurphyscode

We need to better capture the relationships between identifiers between ecosystems. Currently, we have the related_vulnerabilities column, but that is on the package vuln record and requires knowing the grypedb vuln namespace that an id corresponds to, which is difficult when that namespace is something dynamic like with github where the GHSA could cover several package ecosystems. We don't really care about the namespace, only that a particular CVE corresponds to a GHSA or vice versa.

Dec 20 '23 13:12 westonsteimel

Tables per provider/ecosystem pair with a schema specific to the ecosystem So we'd have a table with a lookup on provider name and ecosystem name (likely based on the package url spec, though extended in the case of generic purls so we could have specific schemas tailored to classes of binary packages like java, python interpreter, etc?) that would get the name of the table to load the relevant data from. Should it also be separate tables for affected/not affected or the same table?

Jan 20 '24 07:01 westonsteimel

@wagoodman, I was just wondering in light of all of the CDN issues that have been cropping up if it might make sense to do something like partition v6 databases per provider and then perhaps make grype smarter about what it downloads based on what it needs? Like we know currently the sles data far outnumbers the rest but if someone is never scanning sles containers there would be no need to ever fetch that subset of the data in their ci pipelines running grype

Anyways, just a very rough thought and apologies if this has already been discussed elsewhere. I'm "on vacation" so haven't yet seen all of the discussion that may have occurred on Friday.

Aug 17 '24 06:08 westonsteimel

@westonsteimel we were talking about that yesterday. There might be some big performance gains by splitting the DB by provider.

The main drawback I see is that, right now, grype does 2 relatively slow things at the same time: it generates the SBOM, and it downloads the database. But I don't think it can know what database pieces to download before the SBOM is generated. Maybe we could do something like check the distro early, and then download the database for the distro while making the rest of the SBOM.

We've also talked about trying to make incremental updates to the database available, but I think partitioning by provider would be much simpler to implement.

Aug 17 '24 09:08 willmurphyscode

I'm going to close this since this wishlist has been converted into tangible issues for v6

Sep 25 '24 18:09 wagoodman

grype-db grype-db copied to clipboard

🪄 Schema version 6 wishlist

grype-db
grype-db copied to clipboard