vulnerablecode Include commits and patches that fix a vulnerability

The commit that fixed the vulnerability should also be included in the information provided. Anything that can lead to a diff is valuable. This includes links to commits, pull requests and issues.

As suggested by @pombredanne we can use the specification described here, which supports referencing locations in Git, Mercurial, Subversion and Bazaar. A new field named vcs_url can be included for each vulnerability.

The following are some example of links found on NVD, usually reported with the Patch tag:

Commits Lead to diff

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=3890db36665dbff4c415b0b0dc5c8d53b2850870
https://github.com/python/cpython/commit/fbf648ebba32bbc5aa571a4b09e2062a65fd2492
https://www.mercurial-scm.org/repo/hg/rev/1acfc35d478c

Pull Requests Lead to Merge Commit --> diff

https://github.com/mumble-voip/mumble/pull/4032

Issues Lead to PR --> Merge Commit --> diff

https://github.com/proftpd/proftpd/issues/861

Others Extracting diff if present

http://subversion.apache.org/security/CVE-2018-11782-advisory.txt

Sources of commit links

NVD JSON Feed
https://github.com/google/vulncode-db
https://github.com/SAP/project-kb

Jun 20 '20 08:06 elanzini

Thanks for linking project "KB", I guess we should talk (again :-) ) soon and present a demo of our respective work.

Jun 26 '20 14:06 copernico

@elanzini are you sure https://github.com/google/vulncode-db has commit links other than the one provided my NVD ? I've looked at couple of entries at vulncode-db and all seem to have same data as provided by NVD.

FYI I am working on importing project KB. Eventually we want to tag references as you suggested.

Sep 24 '20 09:09 sbs2001

Re: importing from project KB: the kaybee tool can be easily configured to export to whatever (textual) format, and I can assist with that. Also note that the idea of project KB is that there exist an arbitrary number of repositories that share vulnerability information, and not a central repository: instead of replicating the logic of selecting sources and aggregating them, you could consider using kaybee itself.

Heads-up: in the coming days we will release a few hundred vulnerability statements (700 or more); we are currently making a quality-assurance check on the vulnerability data we have (1600+ vulnerabilities at this time).

Sep 24 '20 14:09 copernico

@sbs2001 That only holds for a handful of cases where they were manually curated. Most importantly, the biggest problem is to find links that lead to diffs, which indicate what was changed to fix the vulnerability.

Most of those links are GitHub commits, issues, prs but you also have to take into account GitLab, Bugzilla, JIRA tickets, SVN and a bunch of others (e.g. Mailing lists). The landscape is quite fragmented on this front. I am currently working on addressing this problem, trying to extract as many diffs and patches from the links that are gathered.

Are you planning to store just the link to the patches or the diff information as well? (e.g. filename, line numbers)

Sep 24 '20 15:09 elanzini

@copernico

Heads-up: in the coming days we will release a few hundred vulnerability statements (700 or more); we are currently making a quality-assurance check on the vulnerability data we have (1600+ vulnerabilities at this time).

That's awesome you guys rock, can't wait.

As for

the kaybee tool can be easily configured to export to whatever (textual) format, and I can assist with that. Also note that the idea of project KB is that there exist an arbitrary number of repositories that share vulnerability information, and not a central repository: instead of replicating the logic of selecting sources and aggregating them, you could consider using kaybee itself.

I never thought of it this way. Thinking about the tool kaybee I see it very valuable for VulnerableCode and a perfect tool to aid in https://github.com/nexB/vulnerablecode/issues/232 since we eventually(very soon) want to share a knowledge base.

I definitely need to learn more about kaybee but https://sap.github.io/project-kb/ doesn't have much there.

Sep 24 '20 15:09 sbs2001

@elanzini

Are you planning to store just the link to the patches or the diff information as well?

Atm just the links. IMHO vulncode-db does a great job at showing the diff information when it is feeded gh commit links.

I am currently working on addressing this problem, trying to extract as many diffs and patches from the links that are gathered.

That's interesting, is there a repo I can check ? Where do you get these links from ?

Sep 24 '20 15:09 sbs2001

... but https://sap.github.io/project-kb/ doesn't have much there.

True (we have a tool that is not very useful without some data, indeed ;-) ) We were supposed to publish the first batch of statements (that's how in project KB we call the files that contain data about a vulnerability) this week; we could still make it, but not sure because we are conducting an extra round of QA to be sure we publish high quality information (prioritizing repositories that are popular). This is taking a bit longer than planned, but worst case it will be early next week, stay tuned ;-)

Sep 24 '20 16:09 copernico

@sbs2001

Atm just the links. IMHO vulncode-db does a great job at showing the diff information when it is feeded gh commit links.

It's not really about showing the diffs but gathering diff information, not only from gh commits, so that they can be used for research and to pinpoint vulnerabilities at a more fine-grained detail. So, ideally, once you gather patches (that look like this) you can show them and use it in other useful ways.

is there a repo I can check ?

This is the repo but the core of the logic regarding the handling of links and extraction of diffs is done here.

Where do you get these links from ?

This is a list of the sources of information I am pulling from. I am also waiting on vulnerablecode to be deployed and include it a source 🚀

Sep 24 '20 17:09 elanzini

Here is the design I suggest:

issues and PR should be treated as vulnerability references. They may lead to a commit but the way to get there is not structured or explicit, but they are still references for our purposes and stored in VulnerabilityReference
the commit(s) that fix a vulnerability should be tracked in their own field in PackageRelatedVulnerability . Since there can be more than one commit, let's use for now a a text field with one commit per line. Each commit will be encoded as a VCS URL ordered from the oldest to the newest commit

In plain English this means that one or more commits are fixing a vulnerability resolved in a certain package version.

Name-wise the field could be named patched_by (kudos to @sbs2001 for this great name) and its description be:

Optional VCS URL(s) for the commits that patch this vulnerability. The VCS URL syntax is specified by the SPDX specification 2.1, section "package download location field". There is one URL per line ordered from the oldest to the newest commit or revision. These commits must be included in the code of the the referenced package version.

Feb 22 '21 11:02 pombredanne

Here is an updated take on the design and implementation, which is going to happen anytime now!

A design for fix commits and fix patch tracking to support vulnerable code reachability analysis

Context

When I use a package version affected by a vulnerability, a key question is:

Is my code really affected?

This is important to triage the volume of vulnerabilities and apply scarce resources to remediate the most critical issues first.

To answer this question, "reachability analysis" can help by determining if the vulnerable code is reachable or reached when used in my product or app.

To analyze my code for vulnerable code reachability, I first need to know what is the vulnerable code.

Problem

This vulnerable code is a subset of the package's code code content that we track such as diffs or commits. This applies for the code that introduced the security bug, and for the code that fixed the security bug.

In practice, the code that introduced the security bug is rarely documented and available from our data sources. And we more often have access to commits and patches that fix the bug.

Since fixing the bug essentially patches the buggy code, knowing the bug fixing code equals to knowing the code that introduces the bug too, in most cases. There are some exceptions, where a fix may be a workaround to mitigate a bug, without actually fixing the code proper.

So to recap: reachability is important to triage vulnerabilities in their usage context. We need to collect and store the code that fixes a bug to do reachability. We need to also support the review and validation of this code, and in many case the buggy code can only be discovered accurately by an expert human review.

Dec 20 '24 10:12 pombredanne

Solution: How to store fix commits?

We need to create new data models to efficiently store the security bug fixes. These can be:

1. A diff between the last affected version and the first fixing version, (in a very coarsed way) or
1. One of more commits (e.g., essentially a patch), or
1. A patch or diff, that may not align exactly with commits, or
1. A list of affected symbols and identifiers.

In practice, a full diff between two versions is too big to be practically usable, so we will focus on commits, patches and symbols.

We may also need to track specific instructions and remediations guidance, but this is something for the future.

We could track also the code that introduced a vulnerability, but this is rarely available directly, and is the complement of a fixing patch. We can abstract a common model in a later step.

We could have a base CodeChange abstract model would look something like this:

(we can use a JSON field for the list fields)

commits: an optional field listing VCS URL with a commit id
pulls: an optional field listing PR URLs (though this should be best resolved to many commits above)
downloads: an optional field listing download URLs to the patched code
patch: an optional text blob field with the code change patch, in a format TBD (unified diff or git diff)
notes: optional notes, description and instructions
references: an optional field listing reference URLs for this patched code, like articles, etc.
status info: some indication of the review status of the patch. For now this could a "reviewed" boolean field
creation/modified and other standard log fields

We may also need:

base version: an optional field with the base package version that the fix or bug codechange applies to
base commit: an optional field with the base package VCS URL with a commit that the fix or bug codechange applies to (without this a patch would not be applicable)

... though we may also track this using relationships instead.

Then we would have two concrete subclasses:

CodeBug that introduces a bug
CodeFix that fixes a bug

The CodeBug and CodeFix would then be related to package and vulnerabilities:

Since a fix may exists before there is a fixing version is released, we cannot relate the fix to a fixing version
We should relate a fix to the base affected version it applies to

An affected version is also related to vulnerability (and soon advisory)

This also means that a CodeChange could be related to many affected versions, though in practice this may end up being different patches.

An important consideration is: how do we get to a CodeFix?

I use a package version in my app that is found to have a vulnerability. This affected version may not be the first affected version, it could be any affected versions.

If we attach the CodeFix to the first affected version, it requires an extra query to navigate to the version that has a CodeFix attached. An alternative design would store the CodeFix relation with every affacted it applies to, and also track the base version its is based on.

Let's use an example:

CVE-123 affects these packages: [email protected] and is fixed in [email protected] and also affects all intermediate versions: 9.1, 9.2, 9.3, 9.4

We have a patch and commit(s) was applied somewhere between 9.4 and 9.5

Therefore the Codefix

has a base version of 9.4
has a base commit which is the commit before it was commited
is for 9.0, 9.1, 9.2, 9.3, 9.4
may not really work well with versions prior to 9.4: e.g., 9.0, 9.1, 9.2, 9.3

We could have multiple CodeFix for older versions 9.0, 9.1, 9.2, 9.3, but in the most common case, patching would require an update to 9.4 then applying the patch, or an update to 9.5

We are for now making the assumption that the CodeFix related to all affected version. This is not exact, but is an a tolerable approximation as we care about the segment of code that fix the bug for analysis purpose. We do not apply the patch.... and if we do we still have all the correct details.

So the design question is then to relate the CodeFix: -A. only to 9.4 -B. only to 9.0 -C. to 9.0, 9.1, 9.2, 9.3, 9.4

For popular vulnerabilities on Django, Log4J or OpenSSL, the C. approach would create a lot of relationships that may be of dubious value. The CodeFix is less likely to be strictly for 9.0, 9.1, 9.2, 9.3, and the likelness that the code paths and symbols exists is much lower in 9.0 than in 9.3, e.g., older versions are less likely to have buggy code that would be directly patched by the CodeFix, because there could have been refactoring, renaming and so on that occured between in 9.0 and 9.4

Therefore I suggest we only track the base version of a CodeFix for now. This base version should (always?) be the version immediately before the fixing version.

Dec 20 '24 10:12 pombredanne

Another take on the design and how to relate a "CodeFix" to affected and fixing packages:

We have a vulnerable Package version PV1 and a Vulnerability CVE1 that affects this Package.

We store this relationship in the model:
  AffectedByPackageRelatedVulnerability (This is really a vulnerable package)

We also have a Package version PV2 that has a fix applied and that is no longer vulnerable to this Vulnerability CVE1.

We store this relationship in the model:
  FixingPackageRelatedVulnerability (This is really a fixed package)

Somewhere between PV1 and PV2, there is a patch/commit code change that was applied and that is fixing the CVE1 bug.
This is CodeFix1.

The CodeFix model tracks the code change details.

Question: how we relate a CodeFix to CVE1/PV1/PV2?

In our approach, a CodeFix explains how to go from a vulnerable package affected by CVE1 to a fixed packages no longer affected by CVE1



The time of the Package looks like this:


Commits          :   c0     c1      c2     c3       c4      c5       c6        c7     
Versions         :          PV0            PV1                                 PV2
Affected Versions:                         PV1
Fixing Versions  :                                                             PV2
Codefix          :                                          CodeFix1


So, CodeFix1 can exist before PV2, and therefore we may not be able to relate to PV2 at all times.

Question: Can we have CodeFix1 without knowing that the affected PV1 exists?
Answer: No, in this case we cannot create a CodeFix just yet.
We have a vulnerable Package version PV1 and a Vulnerability CVE1 that affects this Package.

We store this relationship in the model:
  AffectedByPackageRelatedVulnerability (This is really a vulnerable package)

We also have a Package version PV2 that has a fix applied and that is no longer vulnerable to this Vulnerability CVE1.

We store this relationship in the model:
  FixingPackageRelatedVulnerability (This is really a fixed package)

Somewhere between PV1 and PV2, there is a patch/commit code change that was applied and that is fixing the CVE1 bug.
This is CodeFix1.

The CodeFix model tracks the code change details.

Question: how we relate a CodeFix to CVE1/PV1/PV2?

In our approach, a CodeFix explains how to go from a vulnerable package affected by CVE1 to a fixed packages no longer affected by CVE1



The timeline of the Package with its versions and commits looks like this:


Commits          :   c0     c1      c2     c3       c4      c5       c6        c7     
Versions         :          PV0            PV1                                 PV2
Affected Versions:                         PV1
Fixing Versions  :                                                             PV2
Codefix          :                                          CodeFix1


So, CodeFix1 can exist before PV2, and therefore we may not be able to relate to PV2 at all times.

Question: Can we have CodeFix1 without knowing that the affected PV1 exists?
Answer: No, in this case we cannot create a CodeFix just yet.

Dec 27 '24 11:12 pombredanne

Related issues:

https://github.com/aboutcode-org/vulnerablecode/issues/1699

Jan 07 '25 11:01 pombredanne

[x] Add API for codeFixes and show codefixes for vulnerabilities affecting a package on /api/v2/package endpoint

Jan 07 '25 14:01 TG1999

This is done now.

We have completed this issue in https://github.com/aboutcode-org/vulnerablecode/pull/1704. We have added models.py https://github.com/aboutcode-org/vulnerablecode/pull/1704/files#diff-7f9f2c92e7163b06d21fa139369d75caed6561d0368b60ea2516cece0220eb5b to track fix commits for a vulnerability. To test this setup Vulnerablecode locally with the help of Readme. After setting up VCIO. Run any importer for example ./manage.py import npm_importer. Then run ./manage.py improve collect_fix_commits. Then run the server using make run and go to /api/v2/packages. You can find list of "code_fixes" for for a package in "affected_by_vulnerabilities" like the screenshots below.

For example: https://public.vulnerablecode.io/api/v2/packages?purl=pkg:alpm/archlinux/[email protected]

"packages": [
            {
                "purl": "pkg:alpm/archlinux/[email protected]",
                "affected_by_vulnerabilities": {
                    "VCID-ask5-nj67-aaaa": {
                        "vulnerability_id": "VCID-ask5-nj67-aaaa",
                        "fixed_by_packages": "pkg:alpm/archlinux/[email protected]",
                        "code_fixes": [
                            "http://public.vulnerablecode.io/api/v2/codefixes/1",
                            "http://public.vulnerablecode.io/api/v2/codefixes/2"
                        ]
                    }
                },
                "fixing_vulnerabilities": [],
                "next_non_vulnerable_version": "2.0.2-1",
                "latest_non_vulnerable_version": "2.0.7-1",
                "risk_score": 3.1
            }
        ]

http://public.vulnerablecode.io/api/v2/codefixes/1 And codefix structure looks like this

{
    "id": 2,
    "commits": [
        "https://github.com/389ds/389-ds-base/commit/cc0f69283abc082488824702dae485b8eae938bc"
    ],
    "pulls": [],
    "downloads": [],
    "patch": null,
    "affected_vulnerability_id": "VCID-ask5-nj67-aaaa",
    "affected_package_purl": "pkg:alpm/archlinux/[email protected]",
    "notes": null,
    "references": [],
    "is_reviewed": false,
    "created_at": "2025-01-16T14:22:45Z",
    "updated_at": "2025-01-16T14:22:45Z"
}

Jan 10 '25 11:01 TG1999