kyverno icon indicating copy to clipboard operation
kyverno copied to clipboard

[Bug] ~5x performance degradation from generation webhook in v1.11.4

Open chuasweechin opened this issue 2 years ago • 6 comments

Kyverno Version

1.11.4

Kubernetes Version

1.26.x

Kubernetes Platform

Bare metal

Kyverno Rule Type

Generate

Description

It is observed that Kyverno v1.11.4 has a ~5x performance degradation for generation webhook as compared to Kyverno v1.9.4.

Kyverno v1.11.4 Screenshot 2024-04-18 at 4 47 33 PM

Kyverno v1.9.4 Screenshot 2024-04-18 at 5 04 33 PM

Steps to reproduce

  1. Install cluster policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-zk-kafka-configmap
spec:
  rules:
  - name: generate-zk-kafka-configmap
    match:
      any:
      - resources:
          kinds:
            - Namespace
          names:
            - sc-gen-test-*
    generate:
      synchronize: true
      apiVersion: v1
      kind: ConfigMap
      name: zk-kafka-address
      namespace: "{{request.object.metadata.name}}"
      data:
        kind: ConfigMap
        data:
          ZK_ADDRESS: "192.168.10.10:2181,192.168.10.11:2181,192.168.10.12:2181-playtime"
  1. Create 500 namespaces
for i in {1..500}
do
   k create ns sc-gen-test-$i
done
  1. Wait for all namespaces to be created

  2. Collect creation timestamp for namespace and configmap

echo "TEST_NAMESPACE|||TEST_NAMESPACE_CREATION_TIMESTAMP|||CONFIGMAP_NAME|||CONFIGMAP_CREATION_TIMESTAMP"
for i in {1..500}
do
  TEST_NAMESPACE=sc-gen-test-$i
  TEST_NAMESPACE_CREATION_TIMESTAMP=$(kubectl get ns $TEST_NAMESPACE --no-headers -o=custom-columns=CREATION:metadata.creationTimestamp | awk '$1 {print $1}')

  CONFIGMAP_NAME=zk-kafka-address
  CONFIGMAP_CREATION_TIMESTAMP=$(kubectl get configmap $CONFIGMAP_NAME -n $TEST_NAMESPACE --no-headers -o=custom-columns=CREATION:metadata.creationTimestamp | awk '$1 {print $1}')

  echo "$TEST_NAMESPACE|||$TEST_NAMESPACE_CREATION_TIMESTAMP|||$CONFIGMAP_NAME|||$CONFIGMAP_CREATION_TIMESTAMP"
done

Expected behavior

N.A.

Screenshots

No response

Kyverno logs

No response

Slack discussion

No response

Troubleshooting

  • [X] I have read and followed the documentation AND the troubleshooting guide.
  • [X] I have searched other issues in this repository and mine is not recorded.

chuasweechin avatar Apr 18 '24 09:04 chuasweechin

Related issue: https://github.com/kyverno/kyverno/issues/9633

chuasweechin avatar Apr 18 '24 09:04 chuasweechin

Also note that there are significant changes for generate in 1.10 with synchronization on:

https://kyverno.io/blog/2023/05/30/kyverno-1.10-released/#generate-rule-refactoring

realshuting avatar Apr 18 '24 09:04 realshuting

@realshuting Is the performance hit an expected tradeoff from the generate-rule-refactoring in v1.10.x onwards?

chuasweechin avatar Apr 18 '24 09:04 chuasweechin

@realshuting Is the performance hit an expected behaviour from the generate-rule-refactoring in v1.10.x onwards?

As the background controller performs all tasks post-admission requests, it's expected to have some "delays" when generating resources in the background. The other aspect of measuring performance is the overall memory consumption, and we have observed increased memory usage of the background controller in 1.10+. We look for continuous optimizations on both.

Currently the background controller has the leader election enabled, it should help in both aspects if we can distribute works across all available replicas.

realshuting avatar Apr 18 '24 09:04 realshuting

@realshuting So can I say that, until more optimizations are put in place in future releases, the performance overhead I see now are expected? Do you have a rough roadmap on when some of these optimizations will come in?

chuasweechin avatar Apr 18 '24 10:04 chuasweechin

Or are there any ways to turn off the new feature to watch for changes in trigger resource so as to optimize performance?

chuasweechin avatar Apr 18 '24 10:04 chuasweechin

I tested with 1.12.5 default memory/cpu settings for the background controller, the max latency between the namespace and networkpolicy creation is around 75s:

sc-gen-test-1|||2024-07-31T11:32:44Z|||zk-kafka-address|||2024-07-31T11:32:44Z
...
sc-gen-test-500|||2024-07-31T11:33:06Z|||zk-kafka-address|||2024-07-31T11:34:21Z

It doesn't seem to be a huge delay. I wonder what could be the difference between our tests?

realshuting avatar Jul 31 '24 11:07 realshuting

I tested with 1.9.4 and the max latency is 29s which is less than 75s with 1.12.5:

sc-gen-test-500|||2024-08-02T14:32:05Z|||zk-kafka-address|||2024-08-02T14:32:34Z

Note that Kyverno 1.9 runs a single controller whereas 1.12 has the admission and background controller when processing the generate rule. I wonder if it's related to the structure of kyverno.

realshuting avatar Aug 02 '24 14:08 realshuting

In April 2024, I was testing against Kyverno v1.9.4 vs v1.11.4.

The max latency I experienced from my end is 2 sec for v1.9.4 vs 12 sec for v1.11.4. The latency in your lab tests for v1.9.4 seems to be 10x worst than my tests though.

I haven't try this out in v1.12.5 but I suppose the latency would still persist given that the structural change was since v1.10.x.


Kyverno v1.9.4 Screenshot 2024-04-18 at 5 04 33 PM

Kyverno v1.11.4 Screenshot 2024-04-18 at 4 47 33 PM

chuasweechin avatar Aug 03 '24 06:08 chuasweechin

Oh ya, I was running the different Kyverno versions with 4 CPU and 4GB memory. These are the args I am running the generation worker with:

  • --clientRateLimitQPS=100
  • --clientRateLimitBurst=100
  • --genWorkers=50

chuasweechin avatar Aug 03 '24 06:08 chuasweechin