vitest
vitest copied to clipboard
Add an API to compare benchmark results
Clear and concise description of the problem
There is currently no way to compare benchmark results between different runs. For example, it's possible to change the implementation details without changing the benchmark, and there is no way to confirm that it didn't introduce any regressions except for manually making screenshots in the terminal.
Suggested solution
Provide a flag to vitest bench command (like --compare=./bench.json) to compare the current benchmark run with the previous one. We already support dumping the benchmark result in json via --reporter=json, so it can be reused for diffing.
Alternative
No response
Additional context
I think it should print results the same way we do now in a table, just compare it also with the previously stored results.
I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.
Validations
- [X] Follow our Code of Conduct
- [X] Read the Contributing Guidelines.
- [X] Read the docs.
- [X] Check that there isn't already an issue that request the same feature to avoid creating a duplicate.
What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.
What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.
These are all external tools and they are meant to be used in CI. Since we don't control what machine is running the benchmark, this issue is only about manual check on your own machine.
It's about iterating on the same function, not about catching regressions.
It's about iterating on the same function, not about catching regressions.
Sort of comparing performance between two branches? Would that be possible without JSON being written to disk?
Would that be possible without JSON being written to disk?
The idea I am describing in the issue is to use the result from the json reporter which dumps the values on the disk already.
Interesting idea!
Speaking of --compare flag, I rememberd this webpack plugin which has a similar idea and compares the bundle stats between the current run and baseline https://github.com/relative-ci/bundle-stats/tree/master/packages/webpack-plugin#bundlestatswebpackpluginoptions
It looks like their approach is that they require either baseline: true or BUNDLE_STATS_BASELINE=true to explicitly dump the current result as a baseline, then such baseline is automatically used to compare against for following runs since compare: true by default.
We already support benchmark json output. We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.
Hello @sheremet-va
That's actually a great idea, I'm working on skott and especially on a pull request that is meant to compare skott's performances over time when running analysis on specific JS/TS projects.
The current way I'm implementing result comparison is by only using n and n-1 versions using git diff to witness the changes, but it would be great to have a complete history indeed and store more than one previous version. For that I tried codspeed but the CI just takes a while, I don't know if it's related to the free-tier or what but the job is taking more than 1h50 to complete while just using vitest bench API takes few minutes so it's not an option for me at all.
I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.
To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.
I feel like trying to diff a set of benchmarks add OS properties in the output might be confusing, you might want to do that at the filename level so that each OS combination can get its own diff?
Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.
So at some point I was kind of wondering, should I create a vitest plugin and do something like codspeed but instead of storing the data on my private cloud, just emit the data files in a dedicated location and provide a way to compare them over time. But if you're willing to integrate that in the core, it might not be that relevant? What do you think?
To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.
Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.
Currently I'm re-writing the outputFile generated by vitest into something like
result-node-18.x.jsonand I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something likenode-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.
How do you store the generated output in CI?
Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.
Yeah better have more information than not enough. Also I'm not sure how stable the GitHub Actions hardware is when not using custom pools, default pools might have agents with more or less cores, so variations can even happen there indeed.
How do you store the generated output in CI?
For now it's only a JSON file written and committed by the CI in my repo (at a specific location near-by the benchmark files themselves) but it could be more sophisticated.
Consequently for now I'm not storing the history of all benchmarks, I'm just overriding it each time a new PR runs with the new results, allowing me to do the diff between n and n-1 versions. But it could be great to keep track of all the benchmarks nonetheless to track big regressions/improvements over time, n and n-1 is a very small dataset
Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter: https://stackblitz.com/edit/vitest-dev-vitest-mm6syc?file=vite.config.ts
demo
# I used VITEST_BENCH_xxx env var for prototype
# save bench data on main branch
# --benchmark.outputFile=main.json
$ VITEST_BENCH_OUTPUT_FILE=main.json npx vitest bench --run
...
# suppose you switched a branch and compare against main
# --benchmark.compare=main.json
$ VITEST_BENCH_COMPARE=main.json npx vitest bench --run
...
RUN v1.4.0 /home/projects/vitest-dev-vitest-mm6syc
[... here current deafult reporter ....]
✓ test/basic.bench.ts (2) 2483ms
✓ sort (2) 2471ms
name hz min max mean p75 p99 p995 p999 rme samples
· normal 58.7590 13.7350 40.1650 17.0187 16.4350 40.1650 40.1650 40.1650 ±13.39% 30 fastest
· reverse 10.6565 74.6000 115.93 93.8395 114.16 115.93 115.93 115.93 ±14.26% 10
[... custom compare reporter ....]
[BENCH] Comparison
current : bench-default.json
baseline : main.json
sort
normal 58.759hz [baseline: 73.419hz] [change: 0.80x ⇓]
reverse 10.656hz [baseline: 10.870hz] [change: 0.98x ⇓]
A few things I noticed:
- isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual. But being able to quickly choose
--comparefrom cli would be important, so I think it's fine to make this special flag. - currently json reporter somehow lacks bench filename (the data is only
{ suitename: BenchmarkResult[] }). So, to persist previous run, probably we can forget about current json reporter and start with better internal format. For this, we can also exclude giganticsamplesbecause I don't think it's needed for comparison report.
FYI, I was also looking around prior arts and for example this one https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines has three flags for comparison purpose:
--save-baseline: like--benchmark.outputFile--baseline: like a proposed--compare--load-baseline: this allows skipping running benchmark and loading existing result as a current run.
We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.
@sheremet-va Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?
Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter
If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.
Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?
I wanted to make sure that tinybench supports providing results based on benchmarks that are not actually running.
- isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual.
If this is a first-class feature, I think it is fine. We already expose some flags that are only relevant for reporters.
- currently json reporter somehow lacks bench filename
We can change the format of json. Benchmark is an experimental feature and doesn't follow semver.
If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.
What you mean by "sorted'? I don't know about how current default tty reporter works, but if comparison summary is required only at the end like in my prototype, then it has complete information as onFinished(files: File[]) where Test.result.benchmark holds full tinybench output TaskResult.
What you mean by "sorted'?
The result in the suite.result.benchmark is sorted based on the median result (fastest first). I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.
This is just my expectation based on what we already do. If other frameworks do it differently and more ergonomically, then we can follow them, but I would expect to see the difference in all table columns, no?
I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.
I might be still missing something, but let me confirm with my PR. My proof of concept is currently separated at the end, but I think it's possible to show together in the current table like you expect without changing anything on tinybench. Let me try that in the PR.
It's a rough mockup, but it's possible to create a table like this:
@sheremet-va Is the direction okay or did you have something else in your mind?
Is the direction okay or did you have something else in your mind?
Yes, this is how I imagined it. I am not qualified enough to tell if it's better this way, so I would wait for more feedback.
What we can also do is reverse the table and duplicate the column instead:
https://github.com/google/benchmark/blob/main/docs/tools.md#modes-of-operation
Also, maybe we should print two tables: the first is the same as we do now, and the second one has a difference in each field:
Side note: would be awesome if we could create a graph 👀 https://pytest-benchmark.readthedocs.io/en/latest/comparing.html