benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

What are the requirements for a benchmark line item?

Open titzer opened this issue 5 years ago • 7 comments

In CG meetings, including the face-to-face in La Coruna, we've discussed what the requirements of a benchmark line item should be. I'm filing this issue to attract and distill discussion around the topic and build consensus for what those criteria should be.

titzer avatar Jul 10 '19 08:07 titzer

Some ideas that have been floated:

  • Require source code
  • Require licensing of source code under a (set of) approved licenses
  • Require build instructions / build scripts
  • Require algorithmic description / code documentation
  • Require multiple workloads if applicable
  • For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains
  • Require each line item to do perform self-validation of outputs (i.e. correctness criteria)

Other thoughts / ideas?

titzer avatar Jul 10 '19 08:07 titzer

I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information.

Require build instructions / build scripts

I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!)

bnjbvr avatar Jul 10 '19 08:07 bnjbvr

I propose to also require that the benchmark results can be programmatically fetched/consumed

This is a great idea, and there's already precedence for consumable testing through wast scripts. Many engines, such as wasmi and cranelift-wasm programmatically consume tests through wat2wasm bindings. Wasm scripts could be extended to include a syntax for benchmarking that requires these line items to be defined as well.

A single benchmark definition could look like this:

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

The binary source could be formatted similarly to a binary module in wast files:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

And the written source code would be inside a textual module. The entire file would then look something like this:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

(module
  ;; Recursive factorial
  (func (export "fac-rec") (param i64) (result i64)
    (if (result i64) (i64.eq (local.get 0) (i64.const 0))
      (then (i64.const 1))
      (else
        (i64.mul (local.get 0) (call 0 (i64.sub (local.get 0) (i64.const 1))))
      )
    )
  )
)

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename? keep .wast?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

Multiple benchmarks of the same art could be described within the same file. E.g. different implementations of factorial.

This would allow current engines to reuse the same code they test with, and then add logic for executing a benchmark instead. It would be hard to model this approach for applications, though for micro, kernel, and domain specific benchmarks this may work.

Would it be appropriate to open an issue?

fisherdarling avatar Jul 10 '19 10:07 fisherdarling

I does not like the idea of a text representation of the binary. There should be a reference to a original binary file created from any tool chain.

I expect also that the original sources are not in the WAT format. It can be in any language. There can be multiple source files. I think a sub folder for the sources of every test seams more practical.

Horcrux7 avatar Jul 10 '19 14:07 Horcrux7

Builds should be entirely deterministic

A docker registry could ensure that the same compilers/toolchains are used every time, allowing anyone to exactly reproduce the build

On Wed, Jul 10, 2019 at 2:31 AM Benjamin Bouvier [email protected] wrote:

I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information.

Require build instructions / build scripts

I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/benchmarks/issues/1?email_source=notifications&email_token=AH3SDLCI5JZQWZSWW6QZH6TP6WM5TA5CNFSM4H7MX5V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSXI6A#issuecomment-509965432, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3SDLFASUZYODZU7IP6JCTP6WM5TANCNFSM4H7MX5VQ .

--

Samuel Warfield (720)-278-8897

[email protected]

Warfields avatar Jul 10 '19 15:07 Warfields

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

jing-bao avatar Jul 12 '19 13:07 jing-bao

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

Agreed the stableness of the benchmark will be an important otherwise it may not be usable for fair comparison.

Require build instructions / build scripts For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains

The binary release should contain the toolchain information like versions and built options etc as well to be part of the benchmark result for performance comparison.

TianyouLi avatar Jul 16 '19 07:07 TianyouLi