Potential Performance regression in SNAPSHOT or infrastructure issues

Open ChrisKujawa opened this issue 1 year ago • 0 comments

Describe the bug

Week 34 goingdown

It seems to be that we have a performance regression in our current SNAPSHOT version. Based on the recent benchmarks (from the past weeks), it looks like something like a buffer is filling over time and reducing at some point the overall throughput. From a general stable 150 PI/s to a ~132 PI/s. Can also be observed with the current events panel, showing how many events/commands are processed.

To back up my thesis that something is filling, as soon we have a role change (not necessarily a pod restart) causing us to reinstall the StreamProcessor we are back to a "normal" stable performance, see screenshots below.

Week 32 - seem to recover after role change 2024-08-30_09-33

Benchmarks

Benchmark Week 32

week32-pod

Based on the metric we could assume that if we have leader change it might recover to a stable performance.

week32-role

Benchmark Week 33

Here it even looks like that a role change caused this, as it happened around the same time when the throughput goes down.

week-33 week-33-pod

Benchmark Week 34

In week 34 it seem to happen even earlier as in previous benchmarks.

week-34-pod week-34-role

Benchmark Week 35

Week 35 is stable so far, the question for how long.

week-35-pod week-35-roles

To Reproduce

Run a benchmark for a while.

Expected behavior

No break-in throughput after running for a while.

Additional notes

It might be also something related to infrastructure, as we can see that the commit latency goes up during the time when the throughput is low

2024-08-30_09-46

Especially interesting for Week 32, where it recovered after some time. Here it looks like the p90 of the commit latency has recovered from 240 ms to 47 ms (more than factor 4).

2024-08-30_09-48

Aug 30 '24 07:08 ChrisKujawa