cve-bin-tool icon indicating copy to clipboard operation
cve-bin-tool copied to clipboard

Improve product vendor matching for component list scanning

Open anthonyharrison opened this issue 3 years ago • 3 comments

Each of the checkers identifies a product/vendor pair to be used if a particular component is detected in a binary file. The allows for instance an item detected as libc or libc6 to be both mapped to the glibc product.

However if a component list is used (e.g. using SBOM or a linux distro ), the product name searched for will be libc or libc6 which as they are not found in the database. will not have any vulnerabilities reported.

One approach would be to have multiple approaches to determine if there is a potential match although there is a risk of an increase in false positives being detected. Some approaches to try could include the of a wildcard e.g. search for product like "%libc%" in a query, search for A-B and A_B, always search for lowercase names, etc

anthonyharrison avatar Jan 05 '22 22:01 anthonyharrison

I like the heuristic approach. I'd like to suggest that we make sure to log the mappings in a way that users can tell that we "guessed" and give them some way to fine-tune the guesses and save the data so they don't have to re-run the heuristic on subsequent runs.

I think we talked about this with the known package lists, but it turned out the mappings were obvious often enough that it wasn't absolutely needed to get the scripts working. That said, I still think it's quite reasonable for cve-bin-tool to maintain some lookup tables to improve mappings as we continue to improve our detection capabilities. I think a lookup table to supplement improved heuristics has a few advantages:

  1. Make it possible possible to map multiple {vendor, product} pairs to a known product string (e.g. I think kerberos has multiple match options).
  2. Possibly speed up runs where the heuristic would be running frequently against similar components (e.g. libc6 on full container scans)
  3. Help us build data about what mappings look like so we can fine tune the heuristic.
  4. Help us build data about mappings as a community service.
  5. Allow us to do fancy "if you got this string from pypi it means one thing but in a .jar file it means something else" matching if we wanted.

We'd want to make it easy for users to contribute knowledge back to us -- maybe prompt them to open a github issue with the data, with the carrot that updating the mappings would make the warnings go away in future versions? (Or even a direct pull request, but I think issues are probably easier.)

I'm not sure about the best format here. For the running of the tool, we'd probably want to use a sqlite db the way we do with nvd, but for pull requests and analysis I think we might want something more text-based and diff-able to improve pull requests and make it easier for people to view the data directly. JSON maybe? And then have the tool consume it into sqlite and update if a new json is provided? Something else?

We should probably spend a bit of time figuring out how the mappings are likely to work and come up with a reasonable data structure.

At a guess, we'll have at least two types of mapping:

  1. [common guess string] -> [NVD {vendor, product} pair] -- These would be mostly 1:1 but potentially also n:n where you could have multiple strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit, kerberos} and {mit, kerberos_5} as we currently see in the checker.)
  2. [common guess string] + [metadata] -> [NVD {vendor, product} pair] -- where the metadata would probably be something about where we found the string, so 'cryptography' in a python requirements.txt wouldn't have to map to the same thing as crypgraphy when found in a java .jar file.

terriko avatar Jan 06 '22 20:01 terriko

Terri

Looks like I have started something that could be a step change in improving the detection capability of the tool.

I agree it needs some more thinking and some design/architecture work would probably be worthwhile before we launch into implementation. No idea if this would be a suitable GSOC project or not because I can't work out how hard it will be become at this stage.

I was also thinking of identifying when we find multiple vendors for a product that maybe we should flag this differently; we currently just put a * to say we guessed the vendor - maybe if there are multiple vendors available we should flag this differently (this should be relatively easy to do). Alternatively we could just provide all product/vendor mappings and find all of the potential vulnerabilities.

I think we can get some automation from the checkers to pre-populate a look up table although it might also be good to allow a user to add new mappings (essentially a form of crowd-sourcing :-)).

Regards

Anthony

On Thu, 6 Jan 2022, 20:43 Terri Oda, @.***> wrote:

I like the heuristic approach. I'd like to suggest that we make sure to log the mappings in a way that users can tell that we "guessed" and give them some way to fine-tune the guesses and save the data so they don't have to re-run the heuristic on subsequent runs.

I think we talked about this with the known package lists, but it turned out the mappings were obvious often enough that it wasn't absolutely needed to get the scripts working. That said, I still think it's quite reasonable for cve-bin-tool to maintain some lookup tables to improve mappings as we continue to improve our detection capabilities. I think a lookup table to supplement improved heuristics has a few advantages:

  1. Make it possible possible to map multiple {vendor, product} pairs to a known product string (e.g. I think kerberos has multiple match options).
  2. Possibly speed up runs where the heuristic would be running frequently against similar components (e.g. libc6 on full container scans)
  3. Help us build data about what mappings look like so we can fine tune the heuristic.
  4. Help us build data about mappings as a community service.
  5. Allow us to do fancy "if you got this string from pypi it means one thing but in a .jar file it means something else" matching if we wanted.

We'd want to make it easy for users to contribute knowledge back to us -- maybe prompt them to open a github issue with the data, with the carrot that updating the mappings would make the warnings go away in future versions? (Or even a direct pull request, but I think issues are probably easier.)

I'm not sure about the best format here. For the running of the tool, we'd probably want to use a sqlite db the way we do with nvd, but for pull requests and analysis I think we might want something more text-based and diff-able to improve pull requests and make it easier for people to view the data directly. JSON maybe? And then have the tool consume it into sqlite and update if a new json is provided? Something else?

We should probably spend a bit of time figuring out how the mappings are likely to work and come up with a reasonable data structure.

At a guess, we'll have at least two types of mapping:

  1. [common guess string] -> [NVD {vendor, product} pair] -- These would be mostly 1:1 but potentially also n:n where you could have multiple strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit, kerberos} and {mit, kerberos_5} as we currently see in the checker.)
  2. [common guess string] + [metadata] -> [NVD {vendor, product} pair] -- where the metadata would probably be something about where we found the string, so 'cryptography' in a python requirements.txt wouldn't have to map to the same thing as crypgraphy when found in a java .jar file.

— Reply to this email directly, view it on GitHub https://github.com/intel/cve-bin-tool/issues/1504#issuecomment-1006921519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAID24IP3POKYLFHAGD5M3UUX5IVANCNFSM5LLAIIEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

anthonyharrison avatar Jan 06 '22 21:01 anthonyharrison

Some progress on this to result in some improved product/vendor matching (I have just tried this with SBOMs for the time being to try out some ideas; there needs to be more thought as regards to how it gets incorporated so that the whole tool benefits)

  1. If multiple vendors are matched, I have modified the code so that all product/vendor mappings are added to the list of parsed products. I found out that the simple test of choosing the first one in the list was missing a valid product/vendor mapping. The downside of this is that there is potentially some increased false or duplicated reporting but I think that is manageable.

  2. I have used the filename patterns used by the checkers as a way of mapping to correct product names (e.g. search for libc6 which is not in the NVD maps to glibc which is in the NVD). This improves the hit rate of products/vendor mappings; the only downside is that the original product name is not reported.

  3. I have noticed that a number of product names are typically reported as A-B (e.g. commons-io) when the product name in the NVD is A_B (i.e. commons_io). I have changed the search to look for both names - this has an improved hit rate particularly for java based products

  4. This seems to primarily apply for Java products, but product names are often reported with a parent package followed by a component name e.g. jetty-. Modifying the search to remove the component name and just search for the parent package increases the hit rate.

These changes result in many more candidate products to have an associated vendor with potential vulnerabilities to be reported.

None of these changes involve any changes to the database structure. However the mapping performed by the checkers of filenames to products is limited to the availability of a checker. I think we may need to think of a way of specifying additional filename to product mappings (as we discover them), possibly by another configuration file to allow for user enhancement/control and independence from the availability of a checker.

On Thu, 6 Jan 2022 at 21:18, Anthony Harrison @.***> wrote:

Terri

Looks like I have started something that could be a step change in improving the detection capability of the tool.

I agree it needs some more thinking and some design/architecture work would probably be worthwhile before we launch into implementation. No idea if this would be a suitable GSOC project or not because I can't work out how hard it will be become at this stage.

I was also thinking of identifying when we find multiple vendors for a product that maybe we should flag this differently; we currently just put a * to say we guessed the vendor - maybe if there are multiple vendors available we should flag this differently (this should be relatively easy to do). Alternatively we could just provide all product/vendor mappings and find all of the potential vulnerabilities.

I think we can get some automation from the checkers to pre-populate a look up table although it might also be good to allow a user to add new mappings (essentially a form of crowd-sourcing :-)).

Regards

Anthony

On Thu, 6 Jan 2022, 20:43 Terri Oda, @.***> wrote:

I like the heuristic approach. I'd like to suggest that we make sure to log the mappings in a way that users can tell that we "guessed" and give them some way to fine-tune the guesses and save the data so they don't have to re-run the heuristic on subsequent runs.

I think we talked about this with the known package lists, but it turned out the mappings were obvious often enough that it wasn't absolutely needed to get the scripts working. That said, I still think it's quite reasonable for cve-bin-tool to maintain some lookup tables to improve mappings as we continue to improve our detection capabilities. I think a lookup table to supplement improved heuristics has a few advantages:

  1. Make it possible possible to map multiple {vendor, product} pairs to a known product string (e.g. I think kerberos has multiple match options).
  2. Possibly speed up runs where the heuristic would be running frequently against similar components (e.g. libc6 on full container scans)
  3. Help us build data about what mappings look like so we can fine tune the heuristic.
  4. Help us build data about mappings as a community service.
  5. Allow us to do fancy "if you got this string from pypi it means one thing but in a .jar file it means something else" matching if we wanted.

We'd want to make it easy for users to contribute knowledge back to us -- maybe prompt them to open a github issue with the data, with the carrot that updating the mappings would make the warnings go away in future versions? (Or even a direct pull request, but I think issues are probably easier.)

I'm not sure about the best format here. For the running of the tool, we'd probably want to use a sqlite db the way we do with nvd, but for pull requests and analysis I think we might want something more text-based and diff-able to improve pull requests and make it easier for people to view the data directly. JSON maybe? And then have the tool consume it into sqlite and update if a new json is provided? Something else?

We should probably spend a bit of time figuring out how the mappings are likely to work and come up with a reasonable data structure.

At a guess, we'll have at least two types of mapping:

  1. [common guess string] -> [NVD {vendor, product} pair] -- These would be mostly 1:1 but potentially also n:n where you could have multiple strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit, kerberos} and {mit, kerberos_5} as we currently see in the checker.)
  2. [common guess string] + [metadata] -> [NVD {vendor, product} pair] -- where the metadata would probably be something about where we found the string, so 'cryptography' in a python requirements.txt wouldn't have to map to the same thing as crypgraphy when found in a java .jar file.

— Reply to this email directly, view it on GitHub https://github.com/intel/cve-bin-tool/issues/1504#issuecomment-1006921519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAID24IP3POKYLFHAGD5M3UUX5IVANCNFSM5LLAIIEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

anthonyharrison avatar Jan 12 '22 22:01 anthonyharrison

Mapping the equivalence of artifact identifiers across different naming schemes (cpe, swid, various purl namespaces, etc) will definitely be one of the challenges for reliably matching vulnerabilities to software - especially since the data quality at all sources will inevitably vary, so we'll also need to be able to compensate for badly-tagged data.

I agree we need to crowd-source this data, and ideally share this effort among projects. Other sources like https://repology.org and https://www.aboutcode.org etc might also be valuable input for this.

As a first step, though, I think 'normalizing' the product name by converting to lowercase and dropping any characters like _ and - would already be a nice improvement for cve-bin-tool. My use case here was that I noticed commons-text was detected by syft (in cyclonedx-json mode) as:

    {
      "bom-ref": "pkg:maven/org.apache.commons/[email protected]?package-id=50aab321a9f4b2fa",
      "type": "library",
      "group": "org.apache.commons",
      "name": "commons-text",
      "version": "1.8",
      "cpe": "cpe:2.3:a:apache-software-foundation:commons-text:1.8:*:*:*:*:*:*:*",
      "purl": "pkg:maven/org.apache.commons/[email protected]",
      ...
      "properties": [
         ...
      ]
   }

Which seems correct, but wasn't matched to https://nvd.nist.gov/vuln/detail/CVE-2022-42889 because the (for one thing) CPE, cpe:2.3:a:apache:commons_text:*:*:*:*:*:*:*:*, uses an underscore.

raboof avatar Oct 27 '22 17:10 raboof

I think we're taking this in the direction of using PURL (i.e. what's planned in #3771 ) as our next phase of improving matching. So I'm going to close this issue, but we may want to revisit it later.

terriko avatar Apr 17 '24 21:04 terriko