Transitive PyPI resolution consumes a large amount of CPU
Scorecard upgrade to osv-scanner v2 this past week and saw a huge increase in resource consumption (at least 5-10x, but it could be another order of magnitude higher as our replica count went from 14 -> 1 worker due to resource related evictions).
Profiling shows 98% of our time spent resolving pypi transitive dependencies. I didnt have time to dig into root causes, but does the profile give you any hints if the inefficiency is depsdev, osv-scalibr, or here (e.g. if there's repeated calls to depsdev being done).
Top cumulative pprof functions
Showing top 50 nodes out of 126
flat flat% sum% cum cum%
0 0% 0% 205.33s 98.53% deps.dev/util/resolve/pypi.(*resolution).resolve
0 0% 0% 205.33s 98.53% deps.dev/util/resolve/pypi.(*resolver).Resolve
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr.Scanner.Scan
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.(*walkContext).handleFile
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.(*walkContext).runExtractor
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.Run
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.RunFS
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.runOnScanRoot
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem.walkIndividualPaths
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem/internal.WalkDirUnsorted
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem/internal.walkDirUnsorted
0 0% 0% 205.33s 98.53% github.com/google/osv-scalibr/extractor/filesystem/language/python/requirementsnet.Extractor.Extract
0 0% 0% 205.33s 98.53% github.com/google/osv-scanner/v2/internal/scalibrextract/language/python/requirementsenhancable.(*Extractor).Extract
0 0% 0% 205.33s 98.53% github.com/google/osv-scanner/v2/pkg/osvscanner.DoScan
0 0% 0% 205.33s 98.53% github.com/google/osv-scanner/v2/pkg/osvscanner.scan
0 0% 0% 205.33s 98.53% github.com/ossf/scorecard/v5/checker.(*Runner).Run
0 0% 0% 205.33s 98.53% github.com/ossf/scorecard/v5/checks.Vulnerabilities
0 0% 0% 205.33s 98.53% github.com/ossf/scorecard/v5/checks/raw.Vulnerabilities
0 0% 0% 205.33s 98.53% github.com/ossf/scorecard/v5/clients.osvClient.ListUnfixedVulnerabilities
0 0% 0% 205.33s 98.53% github.com/ossf/scorecard/v5/pkg/scorecard.runEnabledChecks.func1
0.01s 0.0048% 0.0048% 205.22s 98.48% deps.dev/util/resolve/pypi.(*resolution).attemptToPinCriterion
0.02s 0.0096% 0.014% 205.17s 98.45% deps.dev/util/resolve/pypi.(*resolution).getCriteriaToUpdate
0.06s 0.029% 0.043% 195.28s 93.71% deps.dev/util/resolve/pypi.(*resolution).mergeIntoCriterion
0.05s 0.024% 0.067% 195.13s 93.64% deps.dev/util/resolve/pypi.(*provider).findMatches
0.07s 0.034% 0.1% 194.54s 93.35% deps.dev/util/resolve/pypi.(*provider).matchingVersions
0.02s 0.0096% 0.11% 193.65s 92.93% github.com/google/osv-scalibr/clients/resolution.(*OverrideClient).MatchingVersions
0 0% 0.11% 193.58s 92.89% github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).MatchingVersions
0.25s 0.12% 0.23% 186.88s 89.68% github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).Versions
0.04s 0.019% 0.25% 180.27s 86.51% github.com/google/osv-scalibr/clients/datasource.(*PyPIRegistryAPIClient).GetIndex
0.12s 0.058% 0.31% 179.28s 86.03% encoding/json.Unmarshal
0.16s 0.077% 0.38% 115.78s 55.56% encoding/json.(*decodeState).unmarshal
6.04s 2.90% 3.28% 115.76s 55.55% encoding/json.(*decodeState).object
1.23s 0.59% 3.87% 115.76s 55.55% encoding/json.(*decodeState).value
0.39s 0.19% 4.06% 114.84s 55.11% encoding/json.(*decodeState).array
40.52s 19.44% 23.50% 65.81s 31.58% encoding/json.checkValid
1.66s 0.8% 24.30% 33.06s 15.86% encoding/json.(*decodeState).literalStore
22.31s 10.71% 35.01% 22.68s 10.88% encoding/json.stateInString
12.47s 5.98% 40.99% 19.26s 9.24% encoding/json.(*decodeState).skip
16.56s 7.95% 48.94% 16.89s 8.10% encoding/json.unquoteBytes
1.50s 0.72% 49.66% 16.53s 7.93% runtime.mallocgc
13.19s 6.33% 55.99% 15.88s 7.62% encoding/json.(*decodeState).rescanLiteral
0.11s 0.053% 56.04% 15.03s 7.21% deps.dev/util/semver.System.Parse
0.42s 0.2% 56.24% 14.41s 6.91% deps.dev/util/semver.System.parse
2.04s 0.98% 57.22% 11.51s 5.52% runtime.mapaccess1_faststr
2.20s 1.06% 58.28% 11.29s 5.42% encoding/json.indirect
0.08s 0.038% 58.31% 11.10s 5.33% deps.dev/util/semver.(*versionParser).version
0 0% 58.31% 11.09s 5.32% slices.SortFunc[go.shape.[]string,go.shape.string] (inline)
0.04s 0.019% 58.33% 11.08s 5.32% slices.pdqsortCmpFunc[go.shape.string]
0.27s 0.13% 58.46% 11.02s 5.29% deps.dev/util/semver.(*versionParser).pep440Version (inline)
0.05s 0.024% 58.49% 10.85s 5.21% github.com/google/osv-scalibr/clients/resolution.(*PyPIRegistryClient).Versions.func1
pprof graph
We've temporarily disabled transitive scanning through your experimental scan actions, so we at least have a temporary solution:
ExperimentalScannerActions: osvscanner.ExperimentalScannerActions{
TransitiveScanningActions: osvscanner.TransitiveScanningActions{
Disabled: true,
},
},
@spencerschrock thanks for the profiling - I will take a look.
it seems json.Unmarshal is quite expensive (which is a bit surprise to me).
one potential improvement I can think of is to cache the marshalled struct instead of the response in the cache to reduce the number of calls on json.Unmarshal.
cache the marshalled struct instead of the response
we also observed a huge increase in memory usage (from 6GB to 40GB). Does the cache ever get emptied?
it seems
json.Unmarshalis quite expensive (which is a bit surprise to me).
Perhaps you and I should try the jsonv2 experiment? https://go.dev/blog/jsonv2-exp
The Unmarshal performance of v2 is significantly faster than v1, with benchmarks demonstrating improvements of up to 10x.
we also observed a huge increase in memory usage (from 6GB to 40GB). Does the cache ever get emptied?
The cache probably is not emptied until a full run of osv-scanner, and we are caching all artifacts from various registries (Maven, PyPI, etc). We do have a mechanism called "local registry" that downloads and reads artifacts from a local folder, we probably can utilize that for scorecard - let's chat offline for this!
Perhaps you and I should try the jsonv2 experiment?
Yes I am also looking at other json packages - e.g. json-iterator/go and goccy/go-json
Depending on what fields you need, I'd also consider https://github.com/tidwall/gjson - that is more focused on retrieving specific values via paths, so if you're only needing one or even a couple of fields it might be more efficient because in theory it should bailout parsing ASAP.
fwiw I have looked into alternative JSON libraries in the past since it tends to be where a large amount of our CPU cycles go, but never found anything that gave a clear improvement in speed; I was looking at the speed of a single CLI run though rather than at the scale that scorecard runs at, and I wasn't being super scientific about it 🤷
Another question, right now TransitiveScanningActions is all or nothing. Is there any plan to offer finer-grain control over which data enrichers / external accessors are enabled?
I think that is our plan to have more fine-grain control over which plugins (including enrichers) to use in osv-scanner.
You can already do that with --experimental-plugins flags, you can control which enrichers are enabled: https://google.github.io/osv-scanner/experimental/manual-plugin-selection/#enabling-and-disabling-plugins
Though it's a bit fiddly, we need to work out a better way to indicate this.
You can already do that with
--experimental-pluginsflags
I had seen those, and took a quick search through the codebase, specifically this file: https://github.com/google/osv-scanner/blob/main/internal/scalibrplugin/presets.go
In this case, is the transitive python plugin "python/requirementsenhanceable"?
yes for now, but this will be replaced by the transitive enricher soon: https://github.com/google/osv-scanner/pull/2294, which I aim to get it in before next OSV-Scanner release.
I tried the implementation for both tidwall/gjson and encoding/json/v2 and benchmarks against both implementations indicate 3 times faster improvements.
- However, considering
encoding/json/v2is still experimental, I am a bit reluctant to switch to it now. - For long term, I still prefer to depend on
encoding/json/v2if the performance is improved as stated in the blog post so I am not also in favour of transition totidwall/gjson.
there is another option of using deps.dev API for PyPI requirements:
- scorecard is not experiencing any performance issue for Maven which is currently relying on deps.dev API so I assume using deps.dev for PyPI will also not bring performance concern.
- though PyPI requirements are not available now but this is going to happen soon so I think we can probably disable the transitive scanning for now and turn it back with deps.dev backend when it's available.