benchmark_harness icon indicating copy to clipboard operation
benchmark_harness copied to clipboard

Why is the benchmark reporting 10-times higher values?

Open tomaskulich opened this issue 11 years ago • 7 comments

I understand, it's beneficial (for the sake of accurateness) to run the measured function 10 times in a loop. But, why this is the value which is actually reported? Why not report value divided by 10? It is really confusing!

tomaskulich avatar Feb 02 '15 11:02 tomaskulich

+1, I think its worth going to 2.0.0 to fix this

jakemac53 avatar Apr 13 '15 19:04 jakemac53

cc @johnmccutchan for insight

sethladd avatar Apr 13 '15 19:04 sethladd

This is legacy that we should not change. The purpose of this harness is to replicate the exact same benchmark conditions that we use internally. All historical data has this same scaling in place.

If you have a benchmark that runs under this harness and you speed it up (or slow it down), you can see the relative change to your base line.

tl;dr- this isn't a "code timer" but a benchmark runner that is designed to match our internal benchmarking infrastructure.

johnmccutchan avatar Apr 13 '15 20:04 johnmccutchan

Internally we could stay on 1.0.4 though right? It seems wrong to force this behavior on all users of this package. Or maybe we could hide it behind an option?

jakemac53 avatar Apr 13 '15 20:04 jakemac53

@jakemac53 If the external version changes it makes it impossible for us to compare results to our internal numbers.

I'll discuss what we want to do long term with this package at the next compiler team meeting.

johnmccutchan avatar Apr 14 '15 14:04 johnmccutchan

Ok sounds good, my main concern is that this package is currently advertising itself as the officially endorsed package for writing dart benchmarks, but really it seems like its just for internal use if we can't ever make changes which would throw off our historical measurements. Instead it seems like users should lock themselves to a particular version and choose to upgrade when the benefits of the new features outweigh the cost of having to normalize their historical data.

jakemac53 avatar Apr 14 '15 15:04 jakemac53

Seems like a flag would work. If the examples and internal benchmarks set the same flag, the results would be comparable, without affecting other benchmarks.

skybrian avatar Jul 18 '15 00:07 skybrian