seastar Address non-monotonicity of steal time and other issues

This series primarily addresses the problem that on systems with low amounts of steal, steal time appears negative (i.e., the cumulative steal time counter goes down from sample to sample).

This wrong on the face of it and also causes serious problems as a metric in prometheus since the counter contract (monotonic increase) is violated. This causes spurious "counter reset" detection in prometheus and hence bogus very large or very small steal time results in rate (or similar) queries.

This is addressed in two ways:

We make the sleep time calculation more accurate, which is the underlying reason for negative steal which reduces the error (and so "negativeness") of steal by a couple orders of magnitude. After this change steal time is often 0 when rounded to the nearest ms where it wasn't before.
Because the reduction above still does not prevent small negative steal completely, we change the implementation of the metric to essentially cap steal from below 0 in periods where steal was negative.

The individual changes have further details. I am open to splitting commits that may be less popular or require more discussion into a different PR if it makes sense.

Aug 15 '24 00:08 travisdowns

The changes look reasonable, but isn't steal time terminally broken?

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny, but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

Aug 18 '24 16:08 avikivity

thread_cputime_clock measures both kernel and user time. (Wall time) - (sleep time) overlaps that, since some of the sleep time is spent in the sleeping system call. For epoll/linux-aio this would usually be tiny

Yes exactly, that is the underlying flaw here which is only improved quantitatively here but not eliminated. Another "bug" is that we can't actually even tell if we are going to sleep in the system call: may_sleep can be true, but we can find IO right away in which case the getevents call will return and we did not sleep at all but all this work got counted as sleep and this problem can repeat at high frequency for some IO patterns. However, in practice steal has been quite useful for us in a variety of incidents, even in its flawed state.

However, for aio specifically this adjustment of the timing points provided good results in practice. I don't have good numbers here since this is an old change but based on my long-ago testing pushing down the sleep calculation reduced negative steal to a very small amount, often rounding to 0 (in ms) over the sample period. If there was almost any "true" steal it is now almost always enough to overwhelm the negative steal part. In practice under load we almost always see significant "true" steal even if nothing else is running due to the syscall thread which is constantly working to handle EAGAIN IOs :(.

So this is why we need the second part of this change which simply discards negative steal periods (note that is is different from simply preventing the steal counter from going down).

but for io_uring it could be quite large since the same call is used both to push I/O requests and to sleep. Perhaps we should adjust the io_uring backend to only sleep if it has no events to push.

Is that true? Currently it looks like we use two calls still in the uring backend: io_uring_submit to push and io_uring_wait_cqes to wait. The were some changes in this area I think order to allow a single syscall but they were reverted though even with that change this method still had the separate submit + wait calls.

Aug 19 '24 20:08 travisdowns

Here are some results for a really bad case for the existing steal time calculation. This is running io_tester with this patch which prints out steal time metrics periodically and this config:

- name: steal_printer
  shards: all
  type: steal
  shard_info:
    parallelism: 1
    think_time: 10ms

- name: big_writes
  shards: all
  type: seqwrite
  shard_info:
    parallelism: 1
    reqsize: 1kB
    # shares: 10
    think_time: 0us

and running it like:

ninja io_tester && apps/io_tester/io_tester --duration=10 --conf apps/io_tester/conf.yaml --smp=2 --reactor-backend=linux-aio --idle-poll-time-us=0 --poll-aio=1

Prior to this change, the output is:

Starting evaluation...
0 steal: -42.983 ms awake: 56.688 cpu: 99.671
1 steal: -41.848 ms awake: 57.782 cpu: 99.630
0 steal: -42.842 ms awake: 56.878 cpu: 99.721
1 steal: -43.470 ms awake: 56.266 cpu: 99.737
1 steal: -43.937 ms awake: 55.791 cpu: 99.727
0 steal: -44.238 ms awake: 55.531 cpu: 99.769
1 steal: -42.527 ms awake: 57.136 cpu: 99.663

I.e., every 100 ms, we see ~40 ms of steal, because we think we are awake only ~60ms. Of course, we know that this workload with 0 think time is actually awake 100% of the time because as soon as any IO finishes we submit another one.

After the fix, the output is:

Starting evaluation...
0 steal:  0.445 ms awake: 100.129 cpu: 99.684
1 steal:  0.415 ms awake: 100.097 cpu: 99.682
0 steal:  0.374 ms awake: 100.135 cpu: 99.761
1 steal:  0.396 ms awake: 100.126 cpu: 99.729
0 steal:  0.419 ms awake: 100.147 cpu: 99.728
1 steal:  0.414 ms awake: 100.121 cpu: 99.707
0 steal:  0.428 ms awake: 100.151 cpu: 99.723
1 steal:  0.413 ms awake: 100.133 cpu: 99.720
0 steal:  0.442 ms awake: 100.152 cpu: 99.710
1 steal:  0.420 ms awake: 100.119 cpu: 99.699
...

I.e., we correctly identify that we are awake all the team and steal is low and positive (this is probably real steal, it's running on my noisy laptop).

When I run a stress -c 0 thread using 100% on shard 0, we get output like (after the fix):

Starting evaluation...
1 steal:  0.424 ms awake: 100.137 cpu: 99.713
0 steal: 63.933 ms awake: 120.246 cpu: 56.314
1 steal:  0.595 ms awake: 100.165 cpu: 99.570
0 steal: 48.330 ms awake: 106.006 cpu: 57.676
1 steal:  0.619 ms awake: 100.153 cpu: 99.534
0 steal: 63.599 ms awake: 118.001 cpu: 54.402
1 steal:  0.445 ms awake: 100.135 cpu: 99.690
0 steal: 48.325 ms awake: 105.988 cpu: 57.663
1 steal:  0.578 ms awake: 100.156 cpu: 99.577
0 steal: 63.855 ms awake: 117.987 cpu: 54.132
1 steal:  0.528 ms awake: 100.166 cpu: 99.638
0 steal: 48.104 ms awake: 106.009 cpu: 57.906
1 steal:  0.496 ms awake: 100.147 cpu: 99.651
0 steal: 63.986 ms awake: 117.964 cpu: 53.978

I.e., it is correctly showing 50% steal on shard 0, as we'd expect from a single competing 100% CPU thread. Pre fix, it varies from 20% to 40% steal.

Aug 20 '24 20:08 travisdowns

For io_uring the benefit is much less on that specific test, probably because the overall io_uring behavior is quite different as it does not do "poll-aio=1" behavior: it does to sleep even when disk IO is outstanding, unlike aio backend by default.

Here's the same output for io_uring (with 1000ms output interval now to provide some more averaging as the numbers bounce around a lot):

pre-fix:

0 steal: -69.508 ms awake: 95.746 cpu: 165.254
1 steal: -82.224 ms awake: 117.205 cpu: 199.429
0 steal: -69.770 ms awake: 97.488 cpu: 167.258
1 steal: -82.169 ms awake: 116.266 cpu: 198.435
0 steal: -70.218 ms awake: 96.821 cpu: 167.039
1 steal: -74.251 ms awake: 103.012 cpu: 177.262
0 steal: -70.684 ms awake: 99.314 cpu: 169.998
1 steal: -66.691 ms awake: 90.682 cpu: 157.372
0 steal: -68.950 ms awake: 95.269 cpu: 164.219
1 steal: -69.599 ms awake: 94.950 cpu: 164.549
0 steal: -73.325 ms awake: 104.437 cpu: 177.763
1 steal: -69.886 ms awake: 95.770 cpu: 165.656
0 steal: -72.801 ms awake: 105.222 cpu: 178.023
1 steal: -71.757 ms awake: 101.039 cpu: 172.796
0 steal: -69.816 ms awake: 100.048 cpu: 169.864
1 steal: -73.340 ms awake: 105.095 cpu: 178.435
0 steal: -69.226 ms awake: 97.243 cpu: 166.469
1 steal: -71.744 ms awake: 98.021 cpu: 169.765

post-fix:

Starting evaluation...
0 steal: -36.019 ms awake: 136.256 cpu: 172.275
1 steal: -38.107 ms awake: 141.260 cpu: 179.367
0 steal: -34.019 ms awake: 128.660 cpu: 162.679
1 steal: -33.000 ms awake: 123.472 cpu: 156.472
1 steal: -33.379 ms awake: 124.124 cpu: 157.503
0 steal: -37.188 ms awake: 134.169 cpu: 171.357
1 steal: -40.808 ms awake: 141.996 cpu: 182.804
0 steal: -33.975 ms awake: 126.294 cpu: 160.270
1 steal: -41.464 ms awake: 143.828 cpu: 185.292
0 steal: -33.172 ms awake: 123.730 cpu: 156.902
1 steal: -34.853 ms awake: 121.107 cpu: 155.960
0 steal: -35.433 ms awake: 130.999 cpu: 166.432
1 steal: -34.941 ms awake: 128.392 cpu: 163.333
0 steal: -35.278 ms awake: 130.750 cpu: 166.028
1 steal: -33.433 ms awake: 126.931 cpu: 160.364
0 steal: -33.842 ms awake: 129.780 cpu: 163.622
1 steal: -31.123 ms awake: 119.651 cpu: 150.773
0 steal: -36.534 ms awake: 135.890 cpu: 172.424

So there is a definite reduction in false steal, from about 6-7% to about 3.5% but it's still significant for that backend.

The negative steal is about the same for io_uring even w/o the IO (or without the --idle-poll-time-us=0) arg, and the improvement in this fix is about the same.

Aug 20 '24 20:08 travisdowns

I wanted to add that for all backends we will in some cases (the case where we get all the way to the underlying backend "wait for events" function) make 2 timestamp calls instead of 1, since before the "after" timestamp was shared with the idle_end logic, but now it is not.

However, in other cases we will take 0 timestamps instead of 1, because in try_sleep we abandon trying to sleep in many cases and don't ever get to the sleep location. For aio I expect this latter case to be very common since we don't go to sleep if there is any outstanding IO in the default --poll-aio=1 mode. For other backends it less clear.

Both of these cases are on the "idle" or "about to sleep" paths though so probably less performance sensitive.

Aug 20 '24 21:08 travisdowns

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

Sep 11 '24 19:09 travisdowns

@avikivity or @xemul let me know if there's anything I can do to help move this along or also if it's DOA etc.

@xemul can you handle it?

Sep 11 '24 20:09 avikivity

@xemul any hope that you can take a peek at this one? In our experience this is a big improvement to the accuracy and hence usefulness of the steal time metric.

Oct 11 '24 15:10 travisdowns

Sorry for losing track of this one :( I'll review it shortly

Oct 14 '24 12:10 xemul

https://github.com/scylladb/seastar/commit/b6fac76315e4ddf0dac9ba43823ee2e908bc1ea4 is purely a rebase to the tip of master.

Oct 14 '24 14:10 travisdowns

@xemul all feedback addressed or at least responded to.

Oct 19 '24 01:10 travisdowns

Thanks for the review @xemul !

Oct 21 '24 16:10 travisdowns

seastar seastar copied to clipboard

Address non-monotonicity of steal time and other issues

seastar
seastar copied to clipboard