Summary

After profiling FileParser#parse for bundler, I found that the ruby Parser was extremely hot. The same Gemfile was being parsed twice for each dependency, when it only needed to be parsed a single time per file.

The performance improvement depends on the size of the Gemfile being parsed, but the following example shows the speedup for the discourse/discourse repo.

Before:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      0.121  (± 0.0%) i/s -   1.000  in   8.262804s

After:

Warming up --------------------------------------
               parse     1.000  i/100ms
Calculating -------------------------------------
               parse      6.319  (±15.8%) i/s -  32.000  in   5.099269s

Other Information

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder. Any feedback would be appreciated, I believe I can accomplish something similar without breaking the method signatures if necessary.

Benchmark Script

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  gem "dependabot-common", path: "./common"
  gem "dependabot-bundler", path: "./bundler"
  gem "benchmark-ips"
  # This was helping some DNS issues I was having but is optional
  gem "resolv-replace"
end

package_manager = "bundler"
repo_name = "discourse/discourse"

credentials = [
  {
    "type" => "git_source",
    "host" => "github.com",
    "username" => "x-access-token",
    "password" => ENV["GITHUB_ACCESS_TOKEN"] # A GitHub access token with read access to public repos
  }
]

source = Dependabot::Source.new(
  provider: "github",
  repo: repo_name,
  directory: "/",
)

puts "Fetching #{package_manager} dependency files for #{repo_name}"
fetcher = Dependabot::FileFetchers.for_package_manager(package_manager).new(
  source: source,
  credentials: credentials,
)

puts "Parsing dependencies information"
parser = Dependabot::FileParsers.for_package_manager(package_manager).new(
  dependency_files: fetcher.files,
  source: source,
  credentials: credentials,
)

Benchmark.ips do |x|
  x.report("parse") { parser.parse }
  x.compare!
end

Jul 18 '21 21:07 skipkayhil

Hello @SkipKayhil,

Thank you for this contribution! Indeed, many of Dependabot Core's FileParsers read the same dependencies multiple times.

I wasn't sure what the policy around breaking API changes was, especially for a relatively internal class like GemfileDeclarationFinder

Generally we try to avoid breaking the existing API. I've made a couple of comments in the code pointing out breaking changes (e.g. removing the @dependency variable).

Have you been able to run the test suite to make sure these changes pass? There are instruction in the README on how to run the test suite. Once you have passing tests we can run your PR against the CI 😄

These speedups are very welcome since they have an even bigger effect at scale! Let's try to find a solution that doesn't break the existing API together.

P.S., the Dependabot team is currently short-handed so we may take a while to respond.

Jul 20 '21 23:07 Nishnha

Thanks! Following on from @Nishnha comment. I don't this change would break any APIs as we don't expose this module externally and don't think anyone is relying on it directly.

I also think the interface change makes sense given to goal of only parsing a gemfile once, as there are usually multiple dependencies per file 👍

I've kicked off CI, new contributions need the actions run approved.

Jul 21 '21 09:07 feelepxyz

Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

Jul 21 '21 09:07 jurre

Thanks for looking into this @SkipKayhil, I think it'd be good to instrument this in a full dry-run rather than in isolation, as the parsing step only happens once during an entire update, so even though it's not very fast in isolation it might not make a huge difference overall. That's not to say this isn't worth doing, would just good to get a sense of the overall impact

That's a pretty good point, I hadn't thought about the other steps. So far I've been using dependabot as a library to gather dependency information but haven't yet used it to update things. In that scenario this change makes a pretty big impact, but after running the dry run script it looks like it doesn't end up being as much of an improvement in the whole process. I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

On my (not so powerful) laptop:

# main
$ time bin/dry-run.rb bundler discourse/discourse
...
real    43m9.203s
user    41m10.464s
sys     1m47.502s

And swapping out Benchmark.ips for Benchmark.measure in my original script:

# main
$ ruby benchmark_script.rb
14.032599   0.035577  14.623058 ( 14.636967)

# skipkayhil:gemfile-parse-optimization
$ ruby benchmark_script.rb
0.395320   0.000374   0.901548 (  0.899722)

Jul 22 '21 04:07 skipkayhil

I'll probably be looking into the UpdateChecker process in the future because that seems to be where the bulk of time is spent.

Nice 👍 I think most of the time is spent doing networking, but if we can find any optimizations there, that'd be fantastic!

Jul 22 '21 08:07 jurre

@skipkayhil thanks again for putting up this PR... are you still interested in driving it to completion? Given the millions of jobs that we run, we're always interested in perf improvements. 👍

Sep 15 '22 08:09 jeffwidman

are you still interested in driving it to completion?

Would definitely still love to see this merged! I just gave it a rebase and will update if anything fails CI

Sep 18 '22 01:09 skipkayhil

I had some initial concerns with merging this PR incase it changed some underlying behavior in how we parse dependencies.

Seeing this PR is passing our e2e smoke tests, which tests for consistent output, is reassuring!

Thanks for the massive speed up and benchmark script @skipkayhil - definitely going to try this out on our other ecosystems!

Sep 22 '22 23:09 Nishnha

dependabot-core
dependabot-core copied to clipboard

bundler: optimize gemfile parsing

Summary

Other Information

Benchmark Script

dependabot-core dependabot-core copied to clipboard

bundler: optimize gemfile parsing

Summary

Other Information

Benchmark Script

dependabot-core
dependabot-core copied to clipboard