karmada [umbrella] Karmada proxy performance improve

We have benchmark test for proxy feature of karmada-search. This issue reports the test result, and tracks the improvement works.

Test Enviroment

100 clusters
2 millions pods

Test Records

Cache building duration: 1m18s
CPU cost: 6-12C

Mem cost: 140G

request duration:

Name                         | N      | Min     | Median  | Mean    | StdDev  | Max
=========================================================================================
List (10clusters, 500k pods) | 3224   | 601.9ms | 787.4ms | 928ms   | 344ms   | 2.597s
Get pod [duration]           | 301855 | 1.7ms   | 10ms    | 9.9ms   | 3.6ms   | 218.5ms
Update pod [duration]        | 31372  | 8.5ms   | 19.2ms  | 47.9ms  | 92.8ms  | 2.0501ms

Heap frame graph:

Raw files

karmada-search-log.zip

Sep 16 '22 03:09 ikaven1024

Good job! Looking forward to improvements~~

Sep 16 '22 03:09 RainbowMango

Why list spend much time?

In the listing test, ResourceVersion is set to be 0, objects will be returned from local cache in Cacher.

_, err := kubeClient.CoreV1().Pods(testNamespace).List(context.TODO(), metav1.ListOptions{ResourceVersion: "0", Limit: 500})

But in cache limit doesn't work (see as below), and returns all the objects, almost 10 thousands pods in one member cluster cache, rather than 500. So it spends too much time on converting and transferring large data.

https://github.com/karmada-io/karmada/blob/614e28508336d6c03a938ce1bf0678dafef034f0/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L664-L683

shall we cut the returned objects before response?

Now the behavior is stay the same with single cluster, so i don't think we shall change it.

Test again

I change the test code to below, removing the ResourceVersion option, and will get pod from member cluster apiServer.

_, err := kubeClient.CoreV1().Pods(testNamespace).List(context.TODO(), metav1.ListOptions{Limit: 500})

Test Enviroment

10 clusters
500 thousands pods

List from member cluster directly (baseline)

  List pods
  Name                 | N    | Min     | Median  | Mean    | StdDev  | Max
  =============================================================================
  List pods [duration] | 6496 | 220.4ms | 380.3ms | 461.6ms | 270.8ms | 1.6186s

List from search

  List pods
  Name                 | N    | Min     | Median  | Mean  | StdDev | Max
  =========================================================================
  List pods [duration] | 3224 | 601.9ms | 787.4ms | 928ms | 344ms  | 2.597s

It roughly spends twice time than from member cluster directly, as requests are sent to proxy, and then proxied to member cluster apiserver, it has longer chain.

Sep 20 '22 07:09 ikaven1024

Good job!

In the listing test,

Where is the test code?

Sep 20 '22 12:09 RainbowMango

Good job!

In the listing test,

Where is the test code?

It's still in my fork repo: https://github.com/ikaven1024/karmada/blob/benchmark/test/benchmark/search_test.go#L90

Sep 20 '22 12:09 ikaven1024

Do we need to pull the test code as well?

Sep 20 '22 12:09 XiShanYongYe-Chang

Do we need to pull the test code as well?

pull it in: https://github.com/karmada-io/karmada/pull/2557

Sep 21 '22 06:09 ikaven1024

It roughly spends twice time than from member cluster directly, as requests are sent to proxy, and then proxied to member cluster apiserver, it has longer chain.

Just a question: Does the list need to be obtained from the member cluster by proxy? Not cache in the control plane?

Sep 22 '22 02:09 XiShanYongYe-Chang

Just a question: Does the list need to be obtained from the member cluster by proxy? Not cache in the control plane?

I think we should keep the semantics of resourceVersion. When list with resourceVersion unset, query from backend, just as single cluster do.

resourceVersion unset	resourceVersion="0"	resourceVersion="{value other than 0}"
Most Recent	Any	Not older than

Sep 22 '22 03:09 ikaven1024

I think we should keep the semantics of resourceVersion. When list with resourceVersion unset, query from backend, just as single cluster do.

This is the logic of the current search code, right?

Sep 22 '22 03:09 XiShanYongYe-Chang

This is the logic of the current search code, right?

This logic is actually implemented in Cacher, we cannot change it. https://github.com/karmada-io/karmada/blob/614e28508336d6c03a938ce1bf0678dafef034f0/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L626-L628

Sep 22 '22 03:09 ikaven1024

Ok. I got it, thanks~ Before that, I've always had a misconception that Cacher refers to caching in the search component.

Sep 22 '22 03:09 XiShanYongYe-Chang

karmada karmada copied to clipboard

[umbrella] Karmada proxy performance improve

Test Enviroment

Test Records

Raw files

Why list spend much time?

shall we cut the returned objects before response?

Test again

List from member cluster directly (baseline)

List from search

karmada
karmada copied to clipboard