vulnerablecode Improve VCIO bulk API package lookup performance

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298445423 by @tdruez

Could you tell me the PURL types from the list that are not supported (no data available) by VCIO? Excluding those will reduce the number of "useless" requests to the API. ['gem', 'autotools', 'sourceforge', 'bitbucket', 'rpm', 'gitlab', 'cran', 'windows-program', 'docker', 'bower', 'nuget', 'generic', 'cargo', 'npm', 'deb', 'golang', 'maven', 'composer', 'pypi', 'hackage', 'unknown', 'rubygems', 'about', 'github']

Well, for example we have ±300,000 sourceforge PURL in the nexB Dataspace, doing lookup for those is a total waste of time and resources.

More context: For ±133,000 packages in the nexB Dataspace, it currently takes about 1h and 2,674 HTTP requests made to the VCIO API.

The result is only 1,235 vulnerabilities fetched and created. Seems like there's a lot of wasted time and resources with our current approach.

I suggest these progressive steps:

use a hardcoded list of distinct existing PURL types in VCIO
expose this list of existing PURL types as an endpoint
expose a new special endpoint that would provide a highly-compressed data structure to download quickly from VCIO and that you can query to know if a PURL may exist in VCIO
- this could be an automaton (ahocorasick or FST) leveraging the fact that many PURL share a common prefix, or a bloom filter.
- it would be best cached for a few hours and should come withe client code to use it to filter a (long) list of PURLs to remove these that surely do not exists @ VCIO

Aug 20 '24 12:08 pombredanne

From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298761954

@pombredanne Thanks, this sounds like it will require some work to make this happen.

In the short term, could VCIO expose a new "action" on the package endpoint to get this list of supported types? (Should be a very small and fast query) On the DejaCode side, the process could start with fetching the available types to get a QuerySet limited to those and drastically reduce the number a queries.

>>> unique_types = Package.objects.values_list("type", flat=True).distinct()
>>> unique_types
<PackageQuerySet ['about', 'cargo', 'cocoapods', 'composer', 'deb', 'github', ...

Aug 20 '24 12:08 tdruez

Another examples that takes over a minute to load: https://public.vulnerablecode.io/api/vulnerabilities?vulnerability_id=VCID-j2zf-12g6-aaag

Aug 23 '24 10:08 tdruez

We need to change what we return API data entirely, in a new endpoint that does not provide all the package details in a vulnerability. We care about packages 1st, and less about vulnerabilities, so when querying by vulnerability, we should not serialize so much package data.

Sep 12 '24 10:09 pombredanne

This is a related issue to restructure the API:

https://github.com/aboutcode-org/vulnerablecode/issues/1572

Sep 12 '24 10:09 TG1999

See a first PR to improve the results:

https://github.com/aboutcode-org/vulnerablecode/pull/1558#issuecomment-2370777141

Sep 24 '24 09:09 pombredanne

Fixed by: https://github.com/aboutcode-org/vulnerablecode/pull/1701

Jan 01 '25 13:01 TG1999

This is done now!

PRs for references: https://github.com/aboutcode-org/vulnerablecode/pull/1701 https://github.com/aboutcode-org/vulnerablecode/pull/1558

To test this

Go to https://public.vulnerablecode.io/api/packages/bulk_search

and make a request like this:

{
    "purls": ["pkg:ruby/[email protected]"],
    "purl_only": false,
    "plain_purl": false
}

Additionally we have significantly reduced number of queries to 60% from https://github.com/aboutcode-org/vulnerablecode/commit/7fa45cb0d9dc802a6057edfb003a9f85cfed95fb#diff-d3ca0948dc3b5eb0b1adecaa9da9d7854628b0b6bbcf5f515bed6cab4d894339R474 to https://github.com/aboutcode-org/vulnerablecode/commit/9702c60bb4bac2b98dd988a47948408a16b2cff3#diff-d3ca0948dc3b5eb0b1adecaa9da9d7854628b0b6bbcf5f515bed6cab4d894339R472

Also added indexes for models

Mar 21 '25 12:03 TG1999