osv-scanner icon indicating copy to clipboard operation
osv-scanner copied to clipboard

Transitive PyPI resolution consumes a large amount of CPU

Open spencerschrock opened this issue 1 month ago • 13 comments

Scorecard upgrade to osv-scanner v2 this past week and saw a huge increase in resource consumption (at least 5-10x, but it could be another order of magnitude higher as our replica count went from 14 -> 1 worker due to resource related evictions).

Profiling shows 98% of our time spent resolving pypi transitive dependencies. I didnt have time to dig into root causes, but does the profile give you any hints if the inefficiency is depsdev, osv-scalibr, or here (e.g. if there's repeated calls to depsdev being done).

Top cumulative pprof functions
Showing top 50 nodes out of 126
      flat  flat%   sum%        cum   cum%
         0     0%     0%    205.33s 98.53%  deps.dev/util/resolve/pypi.(*resolution).resolve
         0     0%     0%    205.33s 98.53%  deps.dev/util/resolve/pypi.(*resolver).Resolve
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr.Scanner.Scan
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.(*walkContext).handleFile
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.(*walkContext).runExtractor
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.Run
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.RunFS
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.runOnScanRoot
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem.walkIndividualPaths
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem/internal.WalkDirUnsorted
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem/internal.walkDirUnsorted
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scalibr/extractor/filesystem/language/python/requirementsnet.Extractor.Extract
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scanner/v2/internal/scalibrextract/language/python/requirementsenhancable.(*Extractor).Extract
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scanner/v2/pkg/osvscanner.DoScan
         0     0%     0%    205.33s 98.53%  github.com/google/osv-scanner/v2/pkg/osvscanner.scan
         0     0%     0%    205.33s 98.53%  github.com/ossf/scorecard/v5/checker.(*Runner).Run
         0     0%     0%    205.33s 98.53%  github.com/ossf/scorecard/v5/checks.Vulnerabilities
         0     0%     0%    205.33s 98.53%  github.com/ossf/scorecard/v5/checks/raw.Vulnerabilities
         0     0%     0%    205.33s 98.53%  github.com/ossf/scorecard/v5/clients.osvClient.ListUnfixedVulnerabilities
         0     0%     0%    205.33s 98.53%  github.com/ossf/scorecard/v5/pkg/scorecard.runEnabledChecks.func1
     0.01s 0.0048% 0.0048%    205.22s 98.48%  deps.dev/util/resolve/pypi.(*resolution).attemptToPinCriterion
     0.02s 0.0096% 0.014%    205.17s 98.45%  deps.dev/util/resolve/pypi.(*resolution).getCriteriaToUpdate
     0.06s 0.029% 0.043%    195.28s 93.71%  deps.dev/util/resolve/pypi.(*resolution).mergeIntoCriterion
     0.05s 0.024% 0.067%    195.13s 93.64%  deps.dev/util/resolve/pypi.(*provider).findMatches
     0.07s 0.034%   0.1%    194.54s 93.35%  deps.dev/util/resolve/pypi.(*provider).matchingVersions
     0.02s 0.0096%  0.11%    193.65s 92.93%  github.com/google/osv-scalibr/clients/resolution.(*OverrideClient).MatchingVersions
         0     0%  0.11%    193.58s 92.89%  github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).MatchingVersions
     0.25s  0.12%  0.23%    186.88s 89.68%  github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).Versions
     0.04s 0.019%  0.25%    180.27s 86.51%  github.com/google/osv-scalibr/clients/datasource.(*PyPIRegistryAPIClient).GetIndex
     0.12s 0.058%  0.31%    179.28s 86.03%  encoding/json.Unmarshal
     0.16s 0.077%  0.38%    115.78s 55.56%  encoding/json.(*decodeState).unmarshal
     6.04s  2.90%  3.28%    115.76s 55.55%  encoding/json.(*decodeState).object
     1.23s  0.59%  3.87%    115.76s 55.55%  encoding/json.(*decodeState).value
     0.39s  0.19%  4.06%    114.84s 55.11%  encoding/json.(*decodeState).array
    40.52s 19.44% 23.50%     65.81s 31.58%  encoding/json.checkValid
     1.66s   0.8% 24.30%     33.06s 15.86%  encoding/json.(*decodeState).literalStore
    22.31s 10.71% 35.01%     22.68s 10.88%  encoding/json.stateInString
    12.47s  5.98% 40.99%     19.26s  9.24%  encoding/json.(*decodeState).skip
    16.56s  7.95% 48.94%     16.89s  8.10%  encoding/json.unquoteBytes
     1.50s  0.72% 49.66%     16.53s  7.93%  runtime.mallocgc
    13.19s  6.33% 55.99%     15.88s  7.62%  encoding/json.(*decodeState).rescanLiteral
     0.11s 0.053% 56.04%     15.03s  7.21%  deps.dev/util/semver.System.Parse
     0.42s   0.2% 56.24%     14.41s  6.91%  deps.dev/util/semver.System.parse
     2.04s  0.98% 57.22%     11.51s  5.52%  runtime.mapaccess1_faststr
     2.20s  1.06% 58.28%     11.29s  5.42%  encoding/json.indirect
     0.08s 0.038% 58.31%     11.10s  5.33%  deps.dev/util/semver.(*versionParser).version
         0     0% 58.31%     11.09s  5.32%  slices.SortFunc[go.shape.[]string,go.shape.string] (inline)
     0.04s 0.019% 58.33%     11.08s  5.32%  slices.pdqsortCmpFunc[go.shape.string]
     0.27s  0.13% 58.46%     11.02s  5.29%  deps.dev/util/semver.(*versionParser).pep440Version (inline)
     0.05s 0.024% 58.49%     10.85s  5.21%  github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).Versions.func1

pprof graph Image

We've temporarily disabled transitive scanning through your experimental scan actions, so we at least have a temporary solution:

ExperimentalScannerActions: osvscanner.ExperimentalScannerActions{
	TransitiveScanningActions: osvscanner.TransitiveScanningActions{
		Disabled: true,
	},
},

spencerschrock avatar Nov 06 '25 16:11 spencerschrock

@spencerschrock thanks for the profiling - I will take a look.

cuixq avatar Nov 06 '25 22:11 cuixq

it seems json.Unmarshal is quite expensive (which is a bit surprise to me).

one potential improvement I can think of is to cache the marshalled struct instead of the response in the cache to reduce the number of calls on json.Unmarshal.

cuixq avatar Nov 06 '25 23:11 cuixq

cache the marshalled struct instead of the response

we also observed a huge increase in memory usage (from 6GB to 40GB). Does the cache ever get emptied?

spencerschrock avatar Nov 07 '25 01:11 spencerschrock

it seems json.Unmarshal is quite expensive (which is a bit surprise to me).

Perhaps you and I should try the jsonv2 experiment? https://go.dev/blog/jsonv2-exp

The Unmarshal performance of v2 is significantly faster than v1, with benchmarks demonstrating improvements of up to 10x.

spencerschrock avatar Nov 07 '25 01:11 spencerschrock

we also observed a huge increase in memory usage (from 6GB to 40GB). Does the cache ever get emptied?

The cache probably is not emptied until a full run of osv-scanner, and we are caching all artifacts from various registries (Maven, PyPI, etc). We do have a mechanism called "local registry" that downloads and reads artifacts from a local folder, we probably can utilize that for scorecard - let's chat offline for this!

cuixq avatar Nov 07 '25 02:11 cuixq

Perhaps you and I should try the jsonv2 experiment?

Yes I am also looking at other json packages - e.g. json-iterator/go and goccy/go-json

cuixq avatar Nov 07 '25 02:11 cuixq

Depending on what fields you need, I'd also consider https://github.com/tidwall/gjson - that is more focused on retrieving specific values via paths, so if you're only needing one or even a couple of fields it might be more efficient because in theory it should bailout parsing ASAP.

fwiw I have looked into alternative JSON libraries in the past since it tends to be where a large amount of our CPU cycles go, but never found anything that gave a clear improvement in speed; I was looking at the speed of a single CLI run though rather than at the scale that scorecard runs at, and I wasn't being super scientific about it 🤷

G-Rath avatar Nov 10 '25 22:11 G-Rath

Another question, right now TransitiveScanningActions is all or nothing. Is there any plan to offer finer-grain control over which data enrichers / external accessors are enabled?

spencerschrock avatar Nov 11 '25 16:11 spencerschrock

I think that is our plan to have more fine-grain control over which plugins (including enrichers) to use in osv-scanner.

cuixq avatar Nov 11 '25 21:11 cuixq

You can already do that with --experimental-plugins flags, you can control which enrichers are enabled: https://google.github.io/osv-scanner/experimental/manual-plugin-selection/#enabling-and-disabling-plugins

Though it's a bit fiddly, we need to work out a better way to indicate this.

another-rex avatar Nov 11 '25 22:11 another-rex

You can already do that with --experimental-plugins flags

I had seen those, and took a quick search through the codebase, specifically this file: https://github.com/google/osv-scanner/blob/main/internal/scalibrplugin/presets.go

In this case, is the transitive python plugin "python/requirementsenhanceable"?

spencerschrock avatar Nov 11 '25 22:11 spencerschrock

yes for now, but this will be replaced by the transitive enricher soon: https://github.com/google/osv-scanner/pull/2294, which I aim to get it in before next OSV-Scanner release.

cuixq avatar Nov 11 '25 22:11 cuixq

I tried the implementation for both tidwall/gjson and encoding/json/v2 and benchmarks against both implementations indicate 3 times faster improvements.

  • However, considering encoding/json/v2 is still experimental, I am a bit reluctant to switch to it now.
  • For long term, I still prefer to depend on encoding/json/v2 if the performance is improved as stated in the blog post so I am not also in favour of transition to tidwall/gjson.

there is another option of using deps.dev API for PyPI requirements:

  • scorecard is not experiencing any performance issue for Maven which is currently relying on deps.dev API so I assume using deps.dev for PyPI will also not bring performance concern.
  • though PyPI requirements are not available now but this is going to happen soon so I think we can probably disable the transitive scanning for now and turn it back with deps.dev backend when it's available.

cuixq avatar Nov 12 '25 04:11 cuixq