dependabot-core
dependabot-core copied to clipboard
bundler: optimize gemfile parsing
Summary
After profiling FileParser#parse
for bundler, I found that the ruby
Parser was extremely hot. The same Gemfile was being parsed twice for
each dependency, when it only needed to be parsed a single time per
file.
The performance improvement depends on the size of the Gemfile being parsed, but the following example shows the speedup for the discourse/discourse repo.
Before:
Warming up --------------------------------------
parse 1.000 i/100ms
Calculating -------------------------------------
parse 0.121 (± 0.0%) i/s - 1.000 in 8.262804s
After:
Warming up --------------------------------------
parse 1.000 i/100ms
Calculating -------------------------------------
parse 6.319 (±15.8%) i/s - 32.000 in 5.099269s
Other Information
I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder
. Any feedback would be appreciated, I believe I can accomplish something similar without breaking the method signatures if necessary.
Benchmark Script
# frozen_string_literal: true
require "bundler/inline"
gemfile(true) do
source "https://rubygems.org"
gem "dependabot-common", path: "./common"
gem "dependabot-bundler", path: "./bundler"
gem "benchmark-ips"
# This was helping some DNS issues I was having but is optional
gem "resolv-replace"
end
package_manager = "bundler"
repo_name = "discourse/discourse"
credentials = [
{
"type" => "git_source",
"host" => "github.com",
"username" => "x-access-token",
"password" => ENV["GITHUB_ACCESS_TOKEN"] # A GitHub access token with read access to public repos
}
]
source = Dependabot::Source.new(
provider: "github",
repo: repo_name,
directory: "/",
)
puts "Fetching #{package_manager} dependency files for #{repo_name}"
fetcher = Dependabot::FileFetchers.for_package_manager(package_manager).new(
source: source,
credentials: credentials,
)
puts "Parsing dependencies information"
parser = Dependabot::FileParsers.for_package_manager(package_manager).new(
dependency_files: fetcher.files,
source: source,
credentials: credentials,
)
Benchmark.ips do |x|
x.report("parse") { parser.parse }
x.compare!
end
Hello @SkipKayhil,
Thank you for this contribution!
Indeed, many of Dependabot Core's FileParsers
read the same dependencies multiple times.
I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like
GemfileDeclarationFinder
Generally we try to avoid breaking the existing API. I've made a couple of comments in the code pointing out breaking changes (e.g. removing the @dependency variable).
Have you been able to run the test suite to make sure these changes pass? There are instruction in the README on how to run the test suite. Once you have passing tests we can run your PR against the CI 😄
These speedups are very welcome since they have an even bigger effect at scale! Let's try to find a solution that doesn't break the existing API together.
P.S., the Dependabot team is currently short-handed so we may take a while to respond.
Thanks! Following on from @Nishnha comment. I don't this change would break any APIs as we don't expose this module externally and don't think anyone is relying on it directly.
I also think the interface change makes sense given to goal of only parsing a gemfile once, as there are usually multiple dependencies per file 👍
I've kicked off CI, new contributions need the actions run approved.
Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact
Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact
That's a pretty good point, I hadn't thought about the other steps. So far I've been using dependabot as a library to gather dependency information but haven't yet used it to update things. In that scenario this change makes a pretty big impact, but after running the dry run script it looks like it doesn't end up being as much of an improvement in the whole process. I'll probably be looking into the UpdateChecker
process in the future because that seems to be where the bulk of time is spent.
On my (not so powerful) laptop:
# main
$ time bin/dry-run.rb bundler discourse/discourse
...
real 43m9.203s
user 41m10.464s
sys 1m47.502s
And swapping out Benchmark.ips
for Benchmark.measure
in my original script:
# main
$ ruby benchmark_script.rb
14.032599 0.035577 14.623058 ( 14.636967)
# skipkayhil:gemfile-parse-optimization
$ ruby benchmark_script.rb
0.395320 0.000374 0.901548 ( 0.899722)
I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.
Nice 👍 I think most of the time is spent doing networking, but if we can find any optimizations there, that'd be fantastic!
@skipkayhil thanks again for putting up this PR... are you still interested in driving it to completion? Given the millions of jobs that we run, we're always interested in perf improvements. 👍
are you still interested in driving it to completion?
Would definitely still love to see this merged! I just gave it a rebase and will update if anything fails CI
I had some initial concerns with merging this PR incase it changed some underlying behavior in how we parse dependencies.
Seeing this PR is passing our e2e smoke tests, which tests for consistent output, is reassuring!
Thanks for the massive speed up and benchmark script @skipkayhil - definitely going to try this out on our other ecosystems!