Define vulnerability database schema
Current Behavior
We're aiming to build a mechanism to assemble and distribute a vulnerability database tailored to Dependency-Track's needs.
The goals for our own database are defined here: https://docs.google.com/document/d/1DVV4ik7NGOBc6u-fdPlPVKoplNmSpzDT6iC4FJAFYi0/edit?tab=t.0#heading=h.w22q0gsagz1c
To enable implementation work on this effort, we need a schema that defines how we persist the data in question.
Proposed Behavior
Define a database schema for the vulnerability database itself.
[!NOTE] A SQL schema (e.g. for SQLite) is preferred, but we are open to other solutions.
Quoting some relevant entries of our goals:
- The database MAY leverage existing schemata for vulnerability data, such as CycloneDX, CVE 5.x, and OSV.
- The database MUST combine data from multiple public sources, such as NVD, GitHub Advisories, and OSV.
- The database MUST support addition of more data sources over time, such as Red Hat’s.
- The database SHOULD support regional databases, such as China’s CNNVD. Since not all users will need or even want those, the database SHOULD provide a means to exclude them.
- The database MUST support software and hardware components.
- The database MUST support component identifiers in CPE and Package URL format. * The database SHOULD be designed in a way that allows for more identifiers to be added later, for example GS1 GTIN.
- The database SHOULD represent versions and version ranges in vers format.
- The database MAY support additional matching information, such as imports and symbols for reachability analysis. Go’s database provides this data. Cdxgen can produce call stack information in BOMs.
- The database SHOULD be file-based. Distributing static files via CDN is preferable to operating a publicly available service, both from a cost and availability perspective.
- The database SHOULD provide a mechanism to consume only a subset of data. Consumers should not have to re-download 10s of gigabytes of files, when most of it hasn’t changed.
- The database MAY provide a mechanism to consume it as a stream, or in chunks. New deployments of DT would be operational quicker, if the latest vulnerabilities could be consumed first, in a short period of time.
- The database SHOULD provide a means for maintainers and community members to fix erroneous data. For example, wrong version ranges, or missing severities.
- The database MAY support enrichment of vulnerability data with volatile information, such as EPSS scores, or presence in CISA KEV.
- The database MAY leverage existing schemata for vulnerability data, such as CycloneDX, CVE 5.x, and OSV.
The schema should focus on data we actually need, not on what data is available.
[!NOTE] The vulnerability database’s schema does not necessarily have to align with DT’s database schema. However, it would be beneficial if it was easy to consume from. Since we want this to be file-based, we can't rely on some web API to transform and filter data.
Checklist
- [x] I have read and understand the contributing guidelines
- [x] I have checked the existing issues for whether this enhancement was already requested
My initial plan is to download the data from each source, normalize it, and write it to separate SQLite databases.
Doing it for each source separately allows for parallelization despite SQLite only supporting a single writer at a time. The data model accounts for the fact that datasets are "namespaced" by source, since it can differ across sources. It also has the additional benefit of the import process being very simple to implement and extend. Importers can leverage their databases to keep state, which enables incremental imports.
Once the databases for all (enabled) sources are populated, they are merged into a single database.
This is easy to do because multiple SQLite databases can be attached, which enables cross-database joins.
The merged database is used to cherry-pick / procure data. From hardcoded preferences, to automated data-driven decisions, everything is possible here. If analytics queries are necessary to gain a better understanding of the data (i.e. which source provides the best / most complete data set), we can leverage DuckDB.
The procured data is written to a new SQLite database. At this point, it can be further enriched with dynamic data such as EPSS and KEV. This database can be distributed as-is, or further broken down into smaller databases based on the update timestamp of records. It can be exported into other formats to fulfill more use-cases.
Trying my hand at building a PoC for this here: https://github.com/nscuro/vuln-db/tree/main
The database schema for imports is here: https://github.com/nscuro/vuln-db/blob/main/src/main/resources/schema.sql
This sound remarkably similar to what https://github.com/savoirfairelinux/vulnscout, https://github.com/anchore/grype-db and https://github.com/guacsec/guac are doing, please check out if you can integrate or improve their product before reinventing the wheel. Thank you for your work.
Also adding support for CISA ADP or vulnrichment to the list of goals would be much appreciated.
Please keep in mind that OSV covers multiple input databases, and since 2025-10-01 uses complete namespace separation (à la DEBIAN-CVE-(.*) or CURL-CVE-\1, and CVE-\1 for the source) to separate vulnerability scores and other source related information. See https://github.com/google/osv.dev/issues/2465 for details. Since the OSV folks have opted against integrating that complexity into single data records, it's either a task this database will accomplish, or everyone using DT would need to solve the problem somehow by themselves.