vitest Add an API to compare benchmark results

Clear and concise description of the problem

There is currently no way to compare benchmark results between different runs. For example, it's possible to change the implementation details without changing the benchmark, and there is no way to confirm that it didn't introduce any regressions except for manually making screenshots in the terminal.

Alternative

No response

Additional context

I think it should print results the same way we do now in a table, just compare it also with the previously stored results.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

Validations

[X] Follow our Code of Conduct
[X] Read the Contributing Guidelines.
[X] Read the docs.
[X] Check that there isn't already an issue that request the same feature to avoid creating a duplicate.

Feb 26 '24 08:02 sheremet-va

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

Feb 26 '24 14:02 mrazauskas

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

These are all external tools and they are meant to be used in CI. Since we don't control what machine is running the benchmark, this issue is only about manual check on your own machine.

It's about iterating on the same function, not about catching regressions.

Feb 26 '24 14:02 sheremet-va

It's about iterating on the same function, not about catching regressions.

Sort of comparing performance between two branches? Would that be possible without JSON being written to disk?

Feb 26 '24 14:02 mrazauskas

Would that be possible without JSON being written to disk?

The idea I am describing in the issue is to use the result from the json reporter which dumps the values on the disk already.

Feb 26 '24 14:02 sheremet-va

Interesting idea!

Speaking of --compare flag, I rememberd this webpack plugin which has a similar idea and compares the bundle stats between the current run and baseline https://github.com/relative-ci/bundle-stats/tree/master/packages/webpack-plugin#bundlestatswebpackpluginoptions

It looks like their approach is that they require either baseline: true or BUNDLE_STATS_BASELINE=true to explicitly dump the current result as a baseline, then such baseline is automatically used to compare against for following runs since compare: true by default.

Feb 27 '24 01:02 hi-ogawa

We already support benchmark json output. We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

Mar 14 '24 15:03 sheremet-va

Hello @sheremet-va

That's actually a great idea, I'm working on skott and especially on a pull request that is meant to compare skott's performances over time when running analysis on specific JS/TS projects.

The current way I'm implementing result comparison is by only using n and n-1 versions using git diff to witness the changes, but it would be great to have a complete history indeed and store more than one previous version. For that I tried codspeed but the CI just takes a while, I don't know if it's related to the free-tier or what but the job is taking more than 1h50 to complete while just using vitest bench API takes few minutes so it's not an option for me at all.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

I feel like trying to diff a set of benchmarks add OS properties in the output might be confusing, you might want to do that at the filename level so that each OS combination can get its own diff?

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

So at some point I was kind of wondering, should I create a vitest plugin and do something like codspeed but instead of storing the data on my private cloud, just emit the data files in a dedicated location and provide a way to compare them over time. But if you're willing to integrate that in the core, it might not be that relevant? What do you think?

Mar 15 '24 07:03 antoine-coulon

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

How do you store the generated output in CI?

Mar 15 '24 09:03 sheremet-va

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Yeah better have more information than not enough. Also I'm not sure how stable the GitHub Actions hardware is when not using custom pools, default pools might have agents with more or less cores, so variations can even happen there indeed.

How do you store the generated output in CI?

For now it's only a JSON file written and committed by the CI in my repo (at a specific location near-by the benchmark files themselves) but it could be more sophisticated.

Consequently for now I'm not storing the history of all benchmarks, I'm just overriding it each time a new PR runs with the new results, allowing me to do the diff between n and n-1 versions. But it could be great to keep track of all the benchmarks nonetheless to track big regressions/improvements over time, n and n-1 is a very small dataset

Mar 15 '24 10:03 antoine-coulon

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter: https://stackblitz.com/edit/vitest-dev-vitest-mm6syc?file=vite.config.ts

demo

# I used VITEST_BENCH_xxx env var for prototype

# save bench data on main branch
# --benchmark.outputFile=main.json
$ VITEST_BENCH_OUTPUT_FILE=main.json npx vitest bench --run
...

# suppose you switched a branch and compare against main
# --benchmark.compare=main.json
$ VITEST_BENCH_COMPARE=main.json npx vitest bench --run
...
 RUN  v1.4.0 /home/projects/vitest-dev-vitest-mm6syc

[... here current deafult reporter ....]
 ✓ test/basic.bench.ts (2) 2483ms
   ✓ sort (2) 2471ms
     name          hz      min      max     mean      p75      p99     p995     p999      rme  samples
   · normal   58.7590  13.7350  40.1650  17.0187  16.4350  40.1650  40.1650  40.1650  ±13.39%       30   fastest
   · reverse  10.6565  74.6000   115.93  93.8395   114.16   115.93   115.93   115.93  ±14.26%       10  

[... custom compare reporter ....]
 [BENCH] Comparison
   current  : bench-default.json
   baseline : main.json

sort
  normal 58.759hz [baseline: 73.419hz] [change: 0.80x ⇓]
  reverse 10.656hz [baseline: 10.870hz] [change: 0.98x ⇓]

A few things I noticed:

isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual. But being able to quickly choose --compare from cli would be important, so I think it's fine to make this special flag.
currently json reporter somehow lacks bench filename (the data is only { suitename: BenchmarkResult[] }). So, to persist previous run, probably we can forget about current json reporter and start with better internal format. For this, we can also exclude gigantic samples because I don't think it's needed for comparison report.

FYI, I was also looking around prior arts and for example this one https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines has three flags for comparison purpose:

--save-baseline: like --benchmark.outputFile
--baseline: like a proposed --compare
--load-baseline: this allows skipping running benchmark and loading existing result as a current run.

Mar 17 '24 23:03 hi-ogawa

We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

@sheremet-va Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

Mar 18 '24 00:03 hi-ogawa

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

I wanted to make sure that tinybench supports providing results based on benchmarks that are not actually running.

isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual.

If this is a first-class feature, I think it is fine. We already expose some flags that are only relevant for reporters.

currently json reporter somehow lacks bench filename

We can change the format of json. Benchmark is an experimental feature and doesn't follow semver.

Mar 18 '24 08:03 sheremet-va

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

What you mean by "sorted'? I don't know about how current default tty reporter works, but if comparison summary is required only at the end like in my prototype, then it has complete information as onFinished(files: File[]) where Test.result.benchmark holds full tinybench output TaskResult.

Mar 18 '24 09:03 hi-ogawa

What you mean by "sorted'?

The result in the suite.result.benchmark is sorted based on the median result (fastest first). I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

This is just my expectation based on what we already do. If other frameworks do it differently and more ergonomically, then we can follow them, but I would expect to see the difference in all table columns, no?

Mar 18 '24 11:03 sheremet-va

I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

I might be still missing something, but let me confirm with my PR. My proof of concept is currently separated at the end, but I think it's possible to show together in the current table like you expect without changing anything on tinybench. Let me try that in the PR.

Mar 18 '24 11:03 hi-ogawa

It's a rough mockup, but it's possible to create a table like this:

@sheremet-va Is the direction okay or did you have something else in your mind?

Mar 19 '24 07:03 hi-ogawa

Is the direction okay or did you have something else in your mind?

Yes, this is how I imagined it. I am not qualified enough to tell if it's better this way, so I would wait for more feedback.

What we can also do is reverse the table and duplicate the column instead:

https://github.com/google/benchmark/blob/main/docs/tools.md#modes-of-operation

Also, maybe we should print two tables: the first is the same as we do now, and the second one has a difference in each field:

Side note: would be awesome if we could create a graph 👀 https://pytest-benchmark.readthedocs.io/en/latest/comparing.html

Mar 26 '24 14:03 sheremet-va

vitest
vitest copied to clipboard

Add an API to compare benchmark results

Clear and concise description of the problem

Suggested solution

Alternative

Additional context

Validations

vitest vitest copied to clipboard

Add an API to compare benchmark results

Clear and concise description of the problem

Suggested solution

Alternative

Additional context

Validations

vitest
vitest copied to clipboard