Pigeons.jl icon indicating copy to clipboard operation
Pigeons.jl copied to clipboard

Prototype benchmark code

Open alexandrebouchard opened this issue 8 months ago • 15 comments

alexandrebouchard avatar Mar 28 '25 22:03 alexandrebouchard

I was thinking that, instead of / in addition to creating new artificial tests, we could just grab the timing information in the DefaultTestSet structure.. We could either do this at the top level only or recursively for every single testset.

miguelbiron avatar Mar 29 '25 18:03 miguelbiron

@miguelbiron do you mean that we'd collect/report benchmarking results for the tests in our test suite? I'm not sure there will be a tonne of overlap between the things we want to benchmark and the things we want to use for unit testing.

trevorcampbell avatar Mar 29 '25 20:03 trevorcampbell

That's true... Also there's no alloc data in the testset.

miguelbiron avatar Mar 29 '25 20:03 miguelbiron

@miguelbiron @alexandrebouchard I've added the prototype code for the workflows. A brief explanation:

  • the current version of benchmarking results for main is always kept up to date on test/benchmark.csv
  • any time there is a commit to main, benchmark_update.yml runs and refreshes the new benchmarking results, and commits those back to main.
  • any time there is a PR targeting main, benchmark_compare.yml runs on PRs and compares to the current version of the benchmarking results stored in the csv file to the new PR code. The results are posted to the PR thread as a markdown table in a comment. Whenever the PR is updated, the comparison will re-run and the markdown table will be updated.
  • both workflows have a 60 min timeout
  • test/benchmark.jl is responsible for running the benchmark suite (I made a minor modification to the format of the csv)
  • test/compare_benchmarks.jl is responsible for comparing two benchmarking CSVs and output the difference as a markdown table
  • Right now this runs on a github runner, but with a small change, once we are ready to spin up the runner, we can make this self-hosted (it's commented out currently)

One issue we should fix before merging: both benchmark.jl and compare_benchmarks.jl trigger Julia to download/install/build all kinds of packages on the workflow. The benchmarking itself is almost instantaneous, but the package building/installation is super slow (especially for compare_benchmarks.jl, it's like 15-20 mins or so).

I am not entire sure how we do environment setup in the benchmark.jl script, but I just copied it for compare_benchmarks.jl -- I imagine we can probably be smarter about that somehow? Thoughts?

trevorcampbell avatar Mar 29 '25 22:03 trevorcampbell

I also had an idea of something neat we could do: we could include a plot of benchmarking results as a function of time in the Pigeons documentation.

This command:

git log --pretty=format:"%H" --follow -- test/benchmark.csv

will spit out a list of git commit hashes corresponding to commits where test/benchmark.csv was changed. When we build the docs, we can iterate through those, collect all the results CSVs, join them and create PlotlyJS plots for each test, output an html file that is linked from the docs.

Thoughts?

trevorcampbell avatar Mar 29 '25 22:03 trevorcampbell

Ah one more thing @alexandrebouchard we should probably modfiy test/benchmark.jl to run >1 trials for each benchmark and report the median / 25 / 75 or something like that.

(At least on mvn1000 it seems the results vary by around 10% regularly for a single trial)

trevorcampbell avatar Mar 29 '25 22:03 trevorcampbell

Ditto the last comment and also we probably need some basic metadata about the environment. At the very least we should store the exact Julia version used.

miguelbiron avatar Mar 29 '25 22:03 miguelbiron

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 87.82%. Comparing base (e492fa7) to head (c837962). :warning: Report is 34 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #333      +/-   ##
==========================================
+ Coverage   87.40%   87.82%   +0.42%     
==========================================
  Files         107      107              
  Lines        2660     2654       -6     
==========================================
+ Hits         2325     2331       +6     
+ Misses        335      323      -12     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Mar 30 '25 19:03 codecov[bot]

@alexandrebouchard @trevorcampbell the failing docs are due to the old nonreproducible errors that seem to have been fixed with the new DPPL version. We should just merge this PR so that we can then merge #328 which doesn't suffer from this

miguelbiron avatar Mar 30 '25 21:03 miguelbiron

OK I think I'm giving up on reducing precompilation time for now. Let's just push that off to a later PR and get this one sorted.

Seems like the last TODOs here are

  • include basic metainfo in the results CSV
  • run each benchmark > 1 times to average and get estimates of std error
  • store CI console log and Manifest files somewhere

trevorcampbell avatar Apr 18 '25 07:04 trevorcampbell

OK @alexandrebouchard this should be all sorted now. Example PR thread message with diff below:

image

trevorcampbell avatar Aug 16 '25 06:08 trevorcampbell

Ah actually @alexandrebouchard one thing I'd like your eye on before we merge is how packages are activated/installed/used in benchmark.jl and compare_benchmarks.jl

You'll notice the using Pigeons call takes 25s (previously it was 2s) according to the benchmark, which seems off to me (maybe that includes some precompilation, not sure -- you have a better grasp of how julia manages packages and the Pigeons ecosystem than I do)

but once that's sorted we can merge and i'll get the runner set up properly for this repo

trevorcampbell avatar Aug 16 '25 06:08 trevorcampbell

Improved the look&feel of the diff table and included metainfo in the diff

image

trevorcampbell avatar Aug 18 '25 00:08 trevorcampbell

This is looks really great and will be super useful. I'll add more targets very shortly, I'm building a collection.

alexandrebouchard avatar Aug 18 '25 13:08 alexandrebouchard

@alexandrebouchard FYI I replaced versioninfo(verbose=true) with just versioninfo() to avoid the noisy/unreliable CPU measurements from polluting the diffs. It still has almost all the information we'd want (except memory, but c'est la vie)

trevorcampbell avatar Aug 20 '25 19:08 trevorcampbell