benchmarks
benchmarks copied to clipboard
What are the design goals for a benchmark runner?
Some thoughts on the design of a benchmark runner. If we agree I can make a PR documenting these design decisions in a file. WDYT?
Single benchmark run per process
I think there are good arguments for making each run of a benchmark in a separate process. Advantages:
- More accurately reproduces Web conditions, where the majority of instantiations are ephemeral
- Includes startup time in the result
- Allows for comparability between runs. We can expect the 5th run in a process to behave differently from the first due to warmup, which makes statistical aggregation between multiple runs less consistent.
Instantiate from JavaScript
Although there are other WebAssembly environments without JS that are worthy of testing, I think it's reasonable to start benchmarking the JS+Wasm combination, especially given the many embedder-related proposals such as anyref or BigInt/i64.
Headless
Although testing in a browser environment is valuable, there is a lot of utility in a headless benchmark set, especially in the context of microbenchmarks and kernels. It's also easier to start with a portable headless benchmark.
This point and the previous point mean that the tests should be run under V8's d8
shell, SpiderMonkey's js
shell, and WebKit's jsc
.
Metric: Elapsed time to promise fulfillment
A benchmark should be invoked by calling a JavaScript function, potentially with arguments. The function should a promise. If the promise fails, the benchmark runner reports failure and exits with a non-zero error code. If it succeeds, the benchmark runner records the elapsed time.
This approach allows the benchmark run to include the WebAssembly.instantiate time and other related startup times.
One directory == one benchmark
A benchmark named foo
should have a directory in foo/
, containing at least main.js
. That main.js
should be an ES2015 module which exports async function run(env, arg ...)
. The env
arg comes from the benchmark runner, and includes a function env.readBinaryFile(name)
which can read a binary file (usually a .wasm file) from the benchmark's directory. To run the benchmark, the runner will do a dynamic import('foo/main.js')
and invoke the run
method, potentially with arguments.
Checking output
A benchmark only succeeds if the promise's result matches the expected value. The expected value is recorded in the manifest of known benchmarks. Benchmark authors should make sure that the computation performed by the benchmark is necessary to produce the output, i.e. that the benchmark itself is not eliminatable as dead code.
If the benchmark outputs messages to the console -- necessarily via the embedder -- then the benchmark may choose to return the entire output as a string, or a checksum, or just the last line, as appropriate. Checking the entire output is best from a maintenance perspective but for a voluminous output, it may be necessary to just keep a running checksum. Or, if the program is really functional in nature, outputting a result as the last line, then it can make sense to just compare the last line.
Versioning
There are some details about how benchmark data is collected, how scores are calculated, etc that may be interesting to change over time. For this reason, the benchmark should have a version (e.g. 2019.1) that can be updated. An updated version may imply a difference in the benchmark runner, or a difference in the benchmark, or a different set of benchmark score baselines.
Scores
I think we should consider converting elapsed time to abstract scores via baselineTime / elapsedTime
. Each benchmark would have a baseline value. This allows for the convenient "larger-is-better" characteristic, which also provides for more fine-grained comparisons for significant speedups.
I would propose that the initial baseline should be chosen such that a benchmark gets a score of 100 on a standard desktop/laptop machine. This is easiest to develop for, and with the advances in high-end smartphone CPUs, it's comparable for the mobile use-case as well.
Which benchmarks are there?
The benchmark runner has a list of benchmarks. It should be able to print a list of benchmarks in machine-readable form for consumption by other tools.
Statistical aggregation
Here I am no expert. If I were to design a system on my own, I would make 10-20 measurements per individual benchmark, running them as separate invocations. I would report median and standard deviation values in a text output. I would not provide an overall score because I don't know what that means.
I think it's best if another layer of benchmark runner handles the collection and aggregation of multiple data points. Probably it makes sense for that program to be implemented in Python.
To compare to a previous run in text format, I would provide something like WebKit's benchmark runners that indicate which components definitely changed, and by what percentage.
I would likely want to be able to plot a histogram of all data points over a bar chart with medians, as in the graphs from https://wingolog.org/archives/2019/06/26/fibs-lies-and-benchmarks. To compare two runs, the bars from the two runs can be grouped by benchmark. It may be a good idea for the aggregator to output a CSV file with all measurements.
Existing benchmarks for comparison
Prior art in wasm: https://github.com/lars-t-hansen/embenchen/blob/master/asm_v_wasm/wasm_bench.py#L144
A concrete proposal
I think https://github.com/WebAssembly/benchmarks/pull/3/files#diff-2509ee46f4f78dddbf3a749966a7c80d is not a bad first draft. If you agree with the points above, I think we should start with that runner, as its functionality mostly matches these design points.
Good summary of design goals! Some thoughts:
Single benchmark run per process
Will there be multiple iterations of a kernel function inside single benchmark run? If not, the workload of the benchmark must be carefully designed, to avoid too small execution time which cannot be accurately measured. Also, it's hard to measure the performance impact of "tiering" in engines by only testing one iteration.
Headless This point and the previous point mean that the tests should be run under V8's d8 shell, SpiderMonkey's js shell, and WebKit's jsc.
Do we need to consider Node and browser(maybe w/o UI) considering that WebAssembly has a wide usage on them?
Metric: Elapsed time to promise fulfillment
Is it reasonable to distinguish start time and execution time? They may suggest different optimization directions.
Checking output the benchmark may choose to return the entire output as a string, or a checksum, or just the last line, as appropriate
Like the idea of returning different things for output checking! Will there be a detailed guide for how to generate the output properly? For example, we may need a robust check method for float point result. Also, should the calculation of checksum or string be excluded from scoring?
Statistical aggregation
As Ben mentioned in the Overview.md
, there may be different usecases depending on who is using it. The proposal here looks more ‘developer-centric’. For an ordinary benchmark user who just want to know ‘will my machine be fast enough to run WASM’, a single score might be helpful.