dependabot-core icon indicating copy to clipboard operation
dependabot-core copied to clipboard

bundler: optimize gemfile parsing

Open skipkayhil opened this issue 3 years ago • 7 comments

Summary

After profiling FileParser#parse for bundler, I found that the ruby Parser was extremely hot. The same Gemfile was being parsed twice for each dependency, when it only needed to be parsed a single time per file.

The performance improvement depends on the size of the Gemfile being parsed, but the following example shows the speedup for the discourse/discourse repo.

Before:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      0.121  (± 0.0%) i/s -   1.000  in   8.262804s

After:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      6.319  (±15.8%) i/s -  32.000  in   5.099269s

Other Information

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder. Any feedback would be appreciated, I believe I can accomplish something similar without breaking the method signatures if necessary.

Benchmark Script

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  gem "dependabot-common", path: "./common"
  gem "dependabot-bundler", path: "./bundler"
  gem "benchmark-ips"
  # This was helping some DNS issues I was having but is optional
  gem "resolv-replace"
end

package_manager = "bundler"
repo_name = "discourse/discourse"

credentials = [
  {
    "type" => "git_source",
    "host" => "github.com",
    "username" => "x-access-token",
    "password" => ENV["GITHUB_ACCESS_TOKEN"] # A GitHub access token with read access to public repos
  }
]

source = Dependabot::Source.new(
  provider: "github",
  repo: repo_name,
  directory: "/",
)

puts "Fetching #{package_manager} dependency files for #{repo_name}"
fetcher = Dependabot::FileFetchers.for_package_manager(package_manager).new(
  source: source,
  credentials: credentials,
)

puts "Parsing dependencies information"
parser = Dependabot::FileParsers.for_package_manager(package_manager).new(
  dependency_files: fetcher.files,
  source: source,
  credentials: credentials,
)

Benchmark.ips do |x|
  x.report("parse") { parser.parse }
  x.compare!
end

skipkayhil avatar Jul 18 '21 21:07 skipkayhil

Hello @SkipKayhil,

Thank you for this contribution! Indeed, many of Dependabot Core's FileParsers read the same dependencies multiple times.

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder

Generally we try to avoid breaking the existing API. I've made a couple of comments in the code pointing out breaking changes (e.g. removing the @dependency variable).

Have you been able to run the test suite to make sure these changes pass? There are instruction in the README on how to run the test suite. Once you have passing tests we can run your PR against the CI 😄

These speedups are very welcome since they have an even bigger effect at scale! Let's try to find a solution that doesn't break the existing API together.

P.S., the Dependabot team is currently short-handed so we may take a while to respond.

Nishnha avatar Jul 20 '21 23:07 Nishnha

Thanks! Following on from @Nishnha comment. I don't this change would break any APIs as we don't expose this module externally and don't think anyone is relying on it directly.

I also think the interface change makes sense given to goal of only parsing a gemfile once, as there are usually multiple dependencies per file 👍

I've kicked off CI, new contributions need the actions run approved.

feelepxyz avatar Jul 21 '21 09:07 feelepxyz

Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

jurre avatar Jul 21 '21 09:07 jurre

Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

That's a pretty good point, I hadn't thought about the other steps. So far I've been using dependabot as a library to gather dependency information but haven't yet used it to update things. In that scenario this change makes a pretty big impact, but after running the dry run script it looks like it doesn't end up being as much of an improvement in the whole process. I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

On my (not so powerful) laptop:

# main
$ time bin/dry-run.rb bundler discourse/discourse
...
real    43m9.203s
user    41m10.464s
sys     1m47.502s

And swapping out Benchmark.ips for Benchmark.measure in my original script:

# main
$ ruby benchmark_script.rb
14.032599   0.035577  14.623058 ( 14.636967)

# skipkayhil:gemfile-parse-optimization
$ ruby benchmark_script.rb
0.395320   0.000374   0.901548 (  0.899722)

skipkayhil avatar Jul 22 '21 04:07 skipkayhil

I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

Nice 👍 I think most of the time is spent doing networking, but if we can find any optimizations there, that'd be fantastic!

jurre avatar Jul 22 '21 08:07 jurre

@skipkayhil thanks again for putting up this PR... are you still interested in driving it to completion? Given the millions of jobs that we run, we're always interested in perf improvements. 👍

jeffwidman avatar Sep 15 '22 08:09 jeffwidman

are you still interested in driving it to completion?

Would definitely still love to see this merged! I just gave it a rebase and will update if anything fails CI

skipkayhil avatar Sep 18 '22 01:09 skipkayhil

I had some initial concerns with merging this PR incase it changed some underlying behavior in how we parse dependencies.

Seeing this PR is passing our e2e smoke tests, which tests for consistent output, is reassuring!

Thanks for the massive speed up and benchmark script @skipkayhil - definitely going to try this out on our other ecosystems!

Nishnha avatar Sep 22 '22 23:09 Nishnha