karmada icon indicating copy to clipboard operation
karmada copied to clipboard

[umbrella] Karmada proxy performance improve

Open ikaven1024 opened this issue 3 years ago • 11 comments

We have benchmark test for proxy feature of karmada-search. This issue reports the test result, and tracks the improvement works.

Test Enviroment

  • 100 clusters
  • 2 millions pods

Test Records

  • Cache building duration: 1m18s
  • CPU cost: 6-12C

image

  • Mem cost: 140G

image

  • request duration:

    Name                         | N      | Min     | Median  | Mean    | StdDev  | Max
    =========================================================================================
    List (10clusters, 500k pods) | 3224   | 601.9ms | 787.4ms | 928ms   | 344ms   | 2.597s
    Get pod [duration]           | 301855 | 1.7ms   | 10ms    | 9.9ms   | 3.6ms   | 218.5ms
    Update pod [duration]        | 31372  | 8.5ms   | 19.2ms  | 47.9ms  | 92.8ms  | 2.0501ms
    
  • Heap frame graph:

image

Raw files

karmada-search-log.zip

ikaven1024 avatar Sep 16 '22 03:09 ikaven1024

Good job! Looking forward to improvements~~

RainbowMango avatar Sep 16 '22 03:09 RainbowMango

Why list spend much time?

In the listing test, ResourceVersion is set to be 0, objects will be returned from local cache in Cacher.

_, err := kubeClient.CoreV1().Pods(testNamespace).List(context.TODO(), metav1.ListOptions{ResourceVersion: "0", Limit: 500})

But in cache limit doesn't work (see as below), and returns all the objects, almost 10 thousands pods in one member cluster cache, rather than 500. So it spends too much time on converting and transferring large data.

https://github.com/karmada-io/karmada/blob/614e28508336d6c03a938ce1bf0678dafef034f0/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L664-L683

shall we cut the returned objects before response?

Now the behavior is stay the same with single cluster, so i don't think we shall change it.

Test again

I change the test code to below, removing the ResourceVersion option, and will get pod from member cluster apiServer.

_, err := kubeClient.CoreV1().Pods(testNamespace).List(context.TODO(), metav1.ListOptions{Limit: 500})

Test Enviroment

  • 10 clusters
  • 500 thousands pods

List from member cluster directly (baseline)

  List pods
  Name                 | N    | Min     | Median  | Mean    | StdDev  | Max
  =============================================================================
  List pods [duration] | 6496 | 220.4ms | 380.3ms | 461.6ms | 270.8ms | 1.6186s

List from search

  List pods
  Name                 | N    | Min     | Median  | Mean  | StdDev | Max
  =========================================================================
  List pods [duration] | 3224 | 601.9ms | 787.4ms | 928ms | 344ms  | 2.597s

It roughly spends twice time than from member cluster directly, as requests are sent to proxy, and then proxied to member cluster apiserver, it has longer chain.

ikaven1024 avatar Sep 20 '22 07:09 ikaven1024

Good job!

In the listing test,

Where is the test code?

RainbowMango avatar Sep 20 '22 12:09 RainbowMango

Good job!

In the listing test,

Where is the test code?

It's still in my fork repo: https://github.com/ikaven1024/karmada/blob/benchmark/test/benchmark/search_test.go#L90

ikaven1024 avatar Sep 20 '22 12:09 ikaven1024

Do we need to pull the test code as well?

XiShanYongYe-Chang avatar Sep 20 '22 12:09 XiShanYongYe-Chang

Do we need to pull the test code as well?

pull it in: https://github.com/karmada-io/karmada/pull/2557

ikaven1024 avatar Sep 21 '22 06:09 ikaven1024

It roughly spends twice time than from member cluster directly, as requests are sent to proxy, and then proxied to member cluster apiserver, it has longer chain.

Just a question: Does the list need to be obtained from the member cluster by proxy? Not cache in the control plane?

XiShanYongYe-Chang avatar Sep 22 '22 02:09 XiShanYongYe-Chang

Just a question: Does the list need to be obtained from the member cluster by proxy? Not cache in the control plane?

I think we should keep the semantics of resourceVersion. When list with resourceVersion unset, query from backend, just as single cluster do.

resourceVersion unset resourceVersion="0" resourceVersion="{value other than 0}"
Most Recent Any Not older than

ikaven1024 avatar Sep 22 '22 03:09 ikaven1024

I think we should keep the semantics of resourceVersion. When list with resourceVersion unset, query from backend, just as single cluster do.

This is the logic of the current search code, right?

XiShanYongYe-Chang avatar Sep 22 '22 03:09 XiShanYongYe-Chang

This is the logic of the current search code, right?

This logic is actually implemented in Cacher, we cannot change it. https://github.com/karmada-io/karmada/blob/614e28508336d6c03a938ce1bf0678dafef034f0/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L626-L628

ikaven1024 avatar Sep 22 '22 03:09 ikaven1024

Ok. I got it, thanks~ Before that, I've always had a misconception that Cacher refers to caching in the search component.

XiShanYongYe-Chang avatar Sep 22 '22 03:09 XiShanYongYe-Chang