Potential Performance regression in SNAPSHOT or infrastructure issues
Describe the bug
Week 34
It seems to be that we have a performance regression in our current SNAPSHOT version. Based on the recent benchmarks (from the past weeks), it looks like something like a buffer is filling over time and reducing at some point the overall throughput. From a general stable 150 PI/s to a ~132 PI/s. Can also be observed with the current events panel, showing how many events/commands are processed.
To back up my thesis that something is filling, as soon we have a role change (not necessarily a pod restart) causing us to reinstall the StreamProcessor we are back to a "normal" stable performance, see screenshots below.
Week 32 - seem to recover after role change
Benchmarks
Benchmark Week 32
Based on the metric we could assume that if we have leader change it might recover to a stable performance.
Benchmark Week 33
Here it even looks like that a role change caused this, as it happened around the same time when the throughput goes down.
Benchmark Week 34
In week 34 it seem to happen even earlier as in previous benchmarks.
Benchmark Week 35
Week 35 is stable so far, the question for how long.
To Reproduce
Run a benchmark for a while.
Expected behavior
No break-in throughput after running for a while.
Additional notes
It might be also something related to infrastructure, as we can see that the commit latency goes up during the time when the throughput is low
Especially interesting for Week 32, where it recovered after some time. Here it looks like the p90 of the commit latency has recovered from 240 ms to 47 ms (more than factor 4).