grype-db
grype-db copied to clipboard
🪄 Schema version 6 wishlist
As these are implemented, please edit this field to include the PR that implements it within the wishlist below:
- [x] ~Consider switching json column processing to bson or another format that is space-efficient and potentially more performant to parse~ json compresses better than binary data, and we already have a compressed archive, so no benefit of changing this
- [ ] Drop the
version_constraint
andversion_type
columns and consolidate into the newpackage_qualifiers
column - [ ] Remove the
id
table - [ ] Remove the
RecordSource
field from vulnerability metadata (indicates the feed group, which is now fictitious) - [ ] Be able to show
disputed
state for vulnerability without conflating withfixed state
- [ ] Express and distinguish NVD vs CNA CVSS scores (but keep both)
- [ ] Allow for different vulnerability formats to be stored in the DB at once (e.g. OSV + original format)
- [ ] Consider a table per provider approach
- [ ] Capture dates where available (published, updated, withdrawn)
- [ ] Capture CISA info from NVD records (NVD should probably have it's own dedicated table)
- [ ] Capture the built timestamp per provider workspace (already available from the vunnel metadata, just needs someplace to live in grypedb)
- [ ] We want to be able to know what fix in information is available for any record related to a match (even downstream , anchore/grype#1329 and https://github.com/anchore/grype/issues/236)
- [ ] Want to be able to correlate / search records by PURL (not only CPEs) https://github.com/anchore/grype/issues/1267#issuecomment-1583571000
- [ ] maven sha to groupid/artifactid lookup\
- [ ] some way to capture the affected function calls when available from a source (https://github.com/ossf/osv-schema/issues/127)
- [ ] retain more details about namespaces so lookups from available information in grype can be more robust when some data is omitted (https://github.com/anchore/grype-db/issues/108#issuecomment-1684428757)
- [ ] retain not-affected entries somewhere so they can be used for negative match lookups in grype (https://github.com/anchore/grype-db/issues/108#issuecomment-1668474352)
- [ ] GHSA entries should have "title" and "description" fields (right now only "description" column exists, and contains titles of GHSA entries), see https://github.com/anchore/grype/issues/1069
- [ ] GHSA entries should have "published at"
- [ ] Re-orient vulnerability records for non-os packaging ecosystems around the package type and registry? Currently they are oriented around language, but that can cause problems when a singe package registry is shared by multiple language ecosystems and also can get confusing when a single language has many packaging ecosystems (we can support the second case in the current model, but it is probably clearer to re-orient)
Focuses:
- look at space savings for uncompressed DB while keeping matching speed a priority
This wouldn't save space, but currently when RHEL and Mariner feeds report a package as "not affected" we just drop the record in vunnel. It would be helpful if this was instead expressed in the database. (Might overlap with the disputed
requirement mentioned above.)
v6 should have some way of looking up the correct namespace off of something more than just version
. For instance, we're already maintaining mappings in vunnel of codenames to versions for ubuntu and debian and it would be useful to retain that information in the db itself so that grype can still find the correct vuln namespace even if one of the pieces of information might be missing (like in https://github.com/anchore/grype/issues/1446)
Keeping it in the db would mean it could be updated automatically as new namespaces are added and grype could make use of it for lookups immediately whereas maintaining a static mapping in grype means we'd need to remember to maintain that mapping in multiple places (vunnel and grype), and users would need to upgrade to the latest grype for newer namespaces
The ability to know when a particular record was added or modified within the grype database came up in a community discussion. Although we are currently always building up the database from scratch, I think it may make sense to include something like an added
, and modified
timestamp column to each record in the db so that if/when we add diffing/partial updates like in https://github.com/anchore/grype-db/issues/143 we'd already have the necessary columns in place and not necessarily need another schema bump at that time
Updated the top comment to point to https://github.com/anchore/grype/issues/1498 as a specific issue for "Capture dates where available".
We need the ability to represent a CVE that affects different packages, but has different severity ratings for each package. As a concrete example, https://security-tracker.debian.org/tracker/CVE-2023-44487 lists a table with multiple severities for different packages. Here's a snippet of the relevant table on that page, in case it changes or moves:
Package | Type | Release | Fixed Version | Urgency | Origin | Debian Bugs |
---|---|---|---|---|---|---|
h2o | source | buster | 2.2.5+dfsg2-2+deb10u2 | Â | DLA-3638-1 | Â |
h2o | source | (unstable) | 2.2.5+dfsg2-8 | Â | Â | 1054232 |
haproxy | source | (unstable) | 1.8.13-1 | Â | Â | Â |
jetty9 | source | buster | 9.4.50-4+deb10u1 | Â | DLA-3641-1 | Â |
jetty9 | source | bullseye | 9.4.50-4+deb11u1 | Â | DSA-5540-1 | Â |
jetty9 | source | bookworm | 9.4.50-4+deb12u2 | Â | DSA-5540-1 | Â |
jetty9 | source | (unstable) | 9.4.53-1 | Â | Â | Â |
netty | source | (unstable) | (unfixed) | Â | Â | 1054234 |
nghttp2 | source | buster | 1.36.0-2+deb10u2 | Â | DLA-3621-1 | Â |
nghttp2 | source | (unstable) | 1.57.0-1 | Â | Â | 1053769 |
nginx | source | (unstable) | 1.24.0-2 | unimportant | Â | 1053770 |
tomcat10 | source | bookworm | 10.1.6-1+deb12u1 | Â | DSA-5521-1 | Â |
Note the "urgency" is marked as "unimportant" on the row for nginx
. This rating translates in Vunnel to "negligible" in grype-db.
Currently, in the database, this CVE is represented like this:
-- VULNERABILITY TABLE
sqlite> select id, package_name from vulnerability where
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id package_name
-------------- -------------
CVE-2023-44487 h2o
CVE-2023-44487 haproxy
CVE-2023-44487 jetty9
CVE-2023-44487 netty
CVE-2023-44487 nghttp2
CVE-2023-44487 nginx
CVE-2023-44487 tomcat10
CVE-2023-44487 tomcat9
CVE-2023-44487 trafficserver
-- VULNERABILITY_METADATA table
sqlite> select id, severity from vulnerability_metadata where
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id severity
-------------- ----------
CVE-2023-44487 Negligible
As you can see, the database has no good way of writing down, "this CVE is more severe if matched against tomcat than against nginx," but Debian's data is clearly trying to tell us that.
I believe this would be fixed by having a proper foreign key from vulnerability to vulnerability_metadata, rather than just relying on ID+Namespace to match. We could also move the severity column to the vulnerability table, but that would probably result in a lot of duplicate values.
We need to better capture the relationships between identifiers between ecosystems. Currently, we have the related_vulnerabilities
column, but that is on the package vuln record and requires knowing the grypedb vuln namespace that an id corresponds to, which is difficult when that namespace is something dynamic like with github where the GHSA could cover several package ecosystems. We don't really care about the namespace, only that a particular CVE corresponds to a GHSA or vice versa.
Tables per provider/ecosystem pair with a schema specific to the ecosystem So we'd have a table with a lookup on provider name and ecosystem name (likely based on the package url spec, though extended in the case of generic purls so we could have specific schemas tailored to classes of binary packages like java, python interpreter, etc?) that would get the name of the table to load the relevant data from. Should it also be separate tables for affected/not affected or the same table?
@wagoodman, I was just wondering in light of all of the CDN issues that have been cropping up if it might make sense to do something like partition v6 databases per provider and then perhaps make grype smarter about what it downloads based on what it needs? Like we know currently the sles data far outnumbers the rest but if someone is never scanning sles containers there would be no need to ever fetch that subset of the data in their ci pipelines running grype
Anyways, just a very rough thought and apologies if this has already been discussed elsewhere. I'm "on vacation" so haven't yet seen all of the discussion that may have occurred on Friday.
@westonsteimel we were talking about that yesterday. There might be some big performance gains by splitting the DB by provider.
The main drawback I see is that, right now, grype does 2 relatively slow things at the same time: it generates the SBOM, and it downloads the database. But I don't think it can know what database pieces to download before the SBOM is generated. Maybe we could do something like check the distro early, and then download the database for the distro while making the rest of the SBOM.
We've also talked about trying to make incremental updates to the database available, but I think partitioning by provider would be much simpler to implement.
I'm going to close this since this wishlist has been converted into tangible issues for v6