seastar icon indicating copy to clipboard operation
seastar copied to clipboard

Reset _total_stats before each run

Open travisdowns opened this issue 3 years ago • 11 comments

_total_stats were not being reset before each run, meaning that values were inflated by a factor of R for a benchmark with R runs (e.g., by 5x for the default of 5 runs).

This affected only non-time metrics tracked in perf_stats: allocs, tasks and instruction count.

travisdowns avatar Aug 07 '22 01:08 travisdowns

Hunting down some nonsensical instruction counts in a perf test and this turned out to be the cause.

Note that this only affects tests that use perf_tests::start_measuring_time() as tests which instead just implicitly measure the entire time don't use _total_stats.

travisdowns avatar Aug 07 '22 01:08 travisdowns

@tgrabiec please review, it's your code IIRC

avikivity avatar Aug 08 '22 14:08 avikivity

I think @bhalevy has also been in this area recently and may have added the perf_stats functionality (very nice, BTW).

travisdowns avatar Aug 08 '22 15:08 travisdowns

I think @bhalevy has also been in this area recently and may have added the perf_stats functionality (very nice, BTW).

I just copied it from scylla :)

bhalevy avatar Aug 08 '22 16:08 bhalevy

Looks good to me. Can you please run the existing tests and quote before and after numbers to if and what changed, to make sure they still make sense.

(please remember to set your host's cpufreq governor to user at the lowest frequency to eliminate cpu core temperature effects.)

bhalevy avatar Aug 08 '22 16:08 bhalevy

Can you please run the existing tests and quote before and after numbers to if and what changed, to make sure they still make sense.

You mean in the seastar repo, right? I can do that if you'd like, but none of the existing tests use start_measuring_time so I don't expect any difference since _total_stats is not used unless that method is used.

travisdowns avatar Aug 08 '22 16:08 travisdowns

@avikivity It's not mine.

tgrabiec avatar Aug 08 '22 16:08 tgrabiec

Can you please run the existing tests and quote before and after numbers to if and what changed, to make sure they still make sense.

You mean in the seastar repo, right?

yes

I can do that if you'd like, but none of the existing tests use start_measuring_time so I don't expect any difference since _total_stats is not used unless that method is used.

understood, but let's verify that empirically to make sure we haven't missed anything.

bhalevy avatar Aug 08 '22 18:08 bhalevy

I had trouble running fair_queue_perf, as described in https://github.com/scylladb/seastar/issues/1175, fix is in the associated PR.

travisdowns avatar Aug 08 '22 21:08 travisdowns

@bhalevy

I have now run all the existing _perf tests before and after this change (both runs included the cherry-picked fix for #1175), like so:

for t in $PTESTS; do echo "running $t" && tests/perf/$t --smp=2 || { echo "FAILED"; break; }; done | tee -a before.txt

(the after run went to after.txt)

There was some timing variation, as to be expected, though generally modest. Timing does not come from _total_stat, so I think we can neglect this.

I then focused on the last three columns, tasks, allocs which come from _total_stat:

diff -s <(cat before.txt | tr -s ' ' | cut -d' ' -f7-9) <(cat after.txt | tr -s ' ' | cut -d' ' -f7-9)

Result (recall the three columns are allocs tasks inst):

39c39
< 3.500 3.000 993.5
---
> 3.500 3.000 993.4
47,48c47,48
< 3.000 4.000 1128.6
< 2.000 2.500 716.9
---
> 3.000 4.000 1128.5
> 2.000 2.500 716.8
61c61
< 146.000 0.000 2847533.4
---
> 146.000 0.000 2847241.6
65,66c65,66
< 145.000 0.000 1983031.7
< 136.000 0.000 13100595.3
---
> 145.000 0.000 1983031.6
> 136.000 0.000 13100594.6
69,70c69,70
< 188.000 0.000 26000196.5
< 10.000 0.000 17020168.3
---
> 188.000 0.000 26001157.3
> 10.000 0.000 17020168.2
77,78c77,78
< max 665.5 op/s
< max 45619736.3 op/s
---
> max 666.1 op/s
> max 46096694.2 op/s

There are some differences, mostly attributable to small instruction count differences, which is to be expected as this is not completely deterministic. The larger differences appear to occur when more/less tasks were executed, i.e., a difference in suspension point timing, so I think we can also ignore these.

Is it sufficient?

travisdowns avatar Aug 08 '22 22:08 travisdowns

Thanks Benny for the review, it's appreciated.

travisdowns avatar Aug 10 '22 04:08 travisdowns