automq icon indicating copy to clipboard operation
automq copied to clipboard

feat(s3stream): support burst upload for upload WAL task to reduce failover time.

Open lifepuzzlefun opened this issue 6 months ago • 0 comments

Why this patch

we want to optimize the partition failover time. and we found the main part in partition clean shutdown failover is stream close.

and the stream close is slow because of the Upload RateLimit when upload. and the FutureTicker also will cause the force upload task finish in long time. which at most delay 500ms and the 500ms delay may cause the upload data amount much larger.

What is changed

  1. support p99, p95 p50 in DeltaHistogram. use HdrHistogram which calculate p99 in low overhead.
  2. support DeltaWALUploadTask limiter burst mode which cal make all the acquire fast pass.
  3. support FutureTicker to wait small time when force upload.

image

kube800 the yellow line is enabled burst and wait at most 100ms when force upload. kube801 the orange line is enabled burst and wait at most 500ms when force upload. kube802 the blue line is no burst.

with this patch the s3stream close time p99 can drop from 3s to 500ms in our environment

test case

3 node each 200MB/s in traffic 30 partition each node. 100 partition total. submit reassign task each 2min which will change partition leader 100 -> 101, 101 -> 102, 102 -> 100.

lifepuzzlefun avatar Jul 29 '24 11:07 lifepuzzlefun