Improve VCIO bulk API package lookup performance
From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298445423 by @tdruez
Could you tell me the PURL types from the list that are not supported (no data available) by VCIO? Excluding those will reduce the number of "useless" requests to the API. ['gem', 'autotools', 'sourceforge', 'bitbucket', 'rpm', 'gitlab', 'cran', 'windows-program', 'docker', 'bower', 'nuget', 'generic', 'cargo', 'npm', 'deb', 'golang', 'maven', 'composer', 'pypi', 'hackage', 'unknown', 'rubygems', 'about', 'github']
Well, for example we have ±300,000 sourceforge PURL in the nexB Dataspace, doing lookup for those is a total waste of time and resources.
More context: For ±133,000 packages in the nexB Dataspace, it currently takes about 1h and 2,674 HTTP requests made to the VCIO API.
The result is only 1,235 vulnerabilities fetched and created. Seems like there's a lot of wasted time and resources with our current approach.
I suggest these progressive steps:
- use a hardcoded list of distinct existing PURL types in VCIO
- expose this list of existing PURL types as an endpoint
- expose a new special endpoint that would provide a highly-compressed data structure to download quickly from VCIO and that you can query to know if a PURL may exist in VCIO
- this could be an automaton (ahocorasick or FST) leveraging the fact that many PURL share a common prefix, or a bloom filter.
- it would be best cached for a few hours and should come withe client code to use it to filter a (long) list of PURLs to remove these that surely do not exists @ VCIO
From https://github.com/aboutcode-org/dejacode/issues/94#issuecomment-2298761954
@pombredanne Thanks, this sounds like it will require some work to make this happen.
In the short term, could VCIO expose a new "action" on the package endpoint to get this list of supported types? (Should be a very small and fast query) On the DejaCode side, the process could start with fetching the available types to get a QuerySet limited to those and drastically reduce the number a queries.
>>> unique_types = Package.objects.values_list("type", flat=True).distinct()
>>> unique_types
<PackageQuerySet ['about', 'cargo', 'cocoapods', 'composer', 'deb', 'github', ...
Another examples that takes over a minute to load: https://public.vulnerablecode.io/api/vulnerabilities?vulnerability_id=VCID-j2zf-12g6-aaag
We need to change what we return API data entirely, in a new endpoint that does not provide all the package details in a vulnerability. We care about packages 1st, and less about vulnerabilities, so when querying by vulnerability, we should not serialize so much package data.
This is a related issue to restructure the API:
- https://github.com/aboutcode-org/vulnerablecode/issues/1572
See a first PR to improve the results:
- https://github.com/aboutcode-org/vulnerablecode/pull/1558#issuecomment-2370777141
Fixed by: https://github.com/aboutcode-org/vulnerablecode/pull/1701
This is done now!
PRs for references: https://github.com/aboutcode-org/vulnerablecode/pull/1701 https://github.com/aboutcode-org/vulnerablecode/pull/1558
To test this
Go to https://public.vulnerablecode.io/api/packages/bulk_search
and make a request like this:
{
"purls": ["pkg:ruby/[email protected]"],
"purl_only": false,
"plain_purl": false
}
Additionally we have significantly reduced number of queries to 60% from https://github.com/aboutcode-org/vulnerablecode/commit/7fa45cb0d9dc802a6057edfb003a9f85cfed95fb#diff-d3ca0948dc3b5eb0b1adecaa9da9d7854628b0b6bbcf5f515bed6cab4d894339R474 to https://github.com/aboutcode-org/vulnerablecode/commit/9702c60bb4bac2b98dd988a47948408a16b2cff3#diff-d3ca0948dc3b5eb0b1adecaa9da9d7854628b0b6bbcf5f515bed6cab4d894339R472
Also added indexes for models