[Bug] ~5x performance degradation from generation webhook in v1.11.4
Kyverno Version
1.11.4
Kubernetes Version
1.26.x
Kubernetes Platform
Bare metal
Kyverno Rule Type
Generate
Description
It is observed that Kyverno v1.11.4 has a ~5x performance degradation for generation webhook as compared to Kyverno v1.9.4.
Kyverno v1.11.4
Kyverno v1.9.4
Steps to reproduce
- Install cluster policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: generate-zk-kafka-configmap
spec:
rules:
- name: generate-zk-kafka-configmap
match:
any:
- resources:
kinds:
- Namespace
names:
- sc-gen-test-*
generate:
synchronize: true
apiVersion: v1
kind: ConfigMap
name: zk-kafka-address
namespace: "{{request.object.metadata.name}}"
data:
kind: ConfigMap
data:
ZK_ADDRESS: "192.168.10.10:2181,192.168.10.11:2181,192.168.10.12:2181-playtime"
- Create 500 namespaces
for i in {1..500}
do
k create ns sc-gen-test-$i
done
-
Wait for all namespaces to be created
-
Collect creation timestamp for namespace and configmap
echo "TEST_NAMESPACE|||TEST_NAMESPACE_CREATION_TIMESTAMP|||CONFIGMAP_NAME|||CONFIGMAP_CREATION_TIMESTAMP"
for i in {1..500}
do
TEST_NAMESPACE=sc-gen-test-$i
TEST_NAMESPACE_CREATION_TIMESTAMP=$(kubectl get ns $TEST_NAMESPACE --no-headers -o=custom-columns=CREATION:metadata.creationTimestamp | awk '$1 {print $1}')
CONFIGMAP_NAME=zk-kafka-address
CONFIGMAP_CREATION_TIMESTAMP=$(kubectl get configmap $CONFIGMAP_NAME -n $TEST_NAMESPACE --no-headers -o=custom-columns=CREATION:metadata.creationTimestamp | awk '$1 {print $1}')
echo "$TEST_NAMESPACE|||$TEST_NAMESPACE_CREATION_TIMESTAMP|||$CONFIGMAP_NAME|||$CONFIGMAP_CREATION_TIMESTAMP"
done
Expected behavior
N.A.
Screenshots
No response
Kyverno logs
No response
Slack discussion
No response
Troubleshooting
- [X] I have read and followed the documentation AND the troubleshooting guide.
- [X] I have searched other issues in this repository and mine is not recorded.
Related issue: https://github.com/kyverno/kyverno/issues/9633
Also note that there are significant changes for generate in 1.10 with synchronization on:
https://kyverno.io/blog/2023/05/30/kyverno-1.10-released/#generate-rule-refactoring
@realshuting Is the performance hit an expected tradeoff from the generate-rule-refactoring in v1.10.x onwards?
@realshuting Is the performance hit an expected behaviour from the generate-rule-refactoring in v1.10.x onwards?
As the background controller performs all tasks post-admission requests, it's expected to have some "delays" when generating resources in the background. The other aspect of measuring performance is the overall memory consumption, and we have observed increased memory usage of the background controller in 1.10+. We look for continuous optimizations on both.
Currently the background controller has the leader election enabled, it should help in both aspects if we can distribute works across all available replicas.
@realshuting So can I say that, until more optimizations are put in place in future releases, the performance overhead I see now are expected? Do you have a rough roadmap on when some of these optimizations will come in?
Or are there any ways to turn off the new feature to watch for changes in trigger resource so as to optimize performance?
I tested with 1.12.5 default memory/cpu settings for the background controller, the max latency between the namespace and networkpolicy creation is around 75s:
sc-gen-test-1|||2024-07-31T11:32:44Z|||zk-kafka-address|||2024-07-31T11:32:44Z
...
sc-gen-test-500|||2024-07-31T11:33:06Z|||zk-kafka-address|||2024-07-31T11:34:21Z
It doesn't seem to be a huge delay. I wonder what could be the difference between our tests?
I tested with 1.9.4 and the max latency is 29s which is less than 75s with 1.12.5:
sc-gen-test-500|||2024-08-02T14:32:05Z|||zk-kafka-address|||2024-08-02T14:32:34Z
Note that Kyverno 1.9 runs a single controller whereas 1.12 has the admission and background controller when processing the generate rule. I wonder if it's related to the structure of kyverno.
In April 2024, I was testing against Kyverno v1.9.4 vs v1.11.4.
The max latency I experienced from my end is 2 sec for v1.9.4 vs 12 sec for v1.11.4. The latency in your lab tests for v1.9.4 seems to be 10x worst than my tests though.
I haven't try this out in v1.12.5 but I suppose the latency would still persist given that the structural change was since v1.10.x.
Kyverno v1.9.4
Kyverno v1.11.4
Oh ya, I was running the different Kyverno versions with 4 CPU and 4GB memory. These are the args I am running the generation worker with:
- --clientRateLimitQPS=100
- --clientRateLimitBurst=100
- --genWorkers=50