metrics-server Automate testing scalability of Metrics Server.

We should mirror work done in kube-state-metrics and introduce automated scalability tests https://github.com/kubernetes/kube-state-metrics/issues/1341

Steps:

Integrate with scalability tests (example PR https://github.com/kubernetes/perf-tests/pull/1761/files)
Measure resource usage and request latency (example PR https://github.com/kubernetes/perf-tests/pull/1684#issuecomment-772355405)
Deploy some dummy HPAs to put load on MS (should be discussed with scalability team)
Document how to access and use scalability test results

/kind feature

Apr 22 '21 10:04 serathius

/help

Apr 22 '21 10:04 serathius

@serathius: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 22 '21 10:04 k8s-ci-robot

I will have a look, any others who are interested, we can discuss together.

Apr 22 '21 14:04 yangjunmyfm192085

/assign

Apr 23 '21 01:04 yangjunmyfm192085

I think the current job is : 1.Integrate with scalability tests : add manifests files to path: /kubernetes/perf-tests/clusterloader2/pkg/prometheus/manifests/exporters/kube-metrics-server/ modify file: /kubernetes/perf-tests/clusterloader2/pkg/prometheus/prometheus.go

2.Measure resource usage and request latency: add files: /kubernetes/perf-tests/clusterloader2/pkg/measurement/common/kube_metrics_server_measurement.go

Am I right? /cc @serathius @wojtek-t

We can do it together @sanwishe @lunhuijie.

Apr 23 '21 09:04 yangjunmyfm192085

metrics server is already deployed in our scalability tests - there is no need to change anything there.

What is missing is to ensure that we measure metrics reflecting its performance (i.e. add metrics-server measurement). And it's mostly about latency and things like that - resource usage we already should have these metrics (or it's super simple to add to existing resource-usage measuerements).

Apr 23 '21 09:04 wojtek-t

cc @mborsz @jkaniuk @marseel

Apr 23 '21 09:04 wojtek-t

Here is example output of resource usage on our 5k scalability test: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1385277590951956480/artifacts/ResourceUsageSummary_load_2021-04-22T21:01:07Z.json { "Name": "metrics-server-v0.3.6-58bc6d979c-xjnq5/metrics-server", "CPU": 2.45379102, "Mem": 3847835648 },

Apr 23 '21 10:04 marseel

Looks like the scalability tests deploy manifests from https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/metrics-server which run older version of Metrics Server with autoscaling enabled. Would it be possible to test latest version of Metrics Server instead of the latest release?

Apr 23 '21 10:04 serathius

Let's start with having tests and then we can discuss what version. I wouldn't exclude that, but I'm also not sure if that't actually the most important thing.

Apr 23 '21 10:04 wojtek-t

@yangjunmyfm192085 Can you look into implementing the third point about measuring request latency like in kubernetes/perf-tests#1684

Apr 23 '21 11:04 serathius

@yangjunmyfm192085 Can you look into implementing the third point about measuring request latency like in kubernetes/perf-tests#1684

ok,let me have a look.

Apr 23 '21 11:04 yangjunmyfm192085

@serathius We are working on this issue, and error occurs since I try to access it by this way:

# curl --cacert /etc/kubernetes/certs/ca.crt --cert /etc/kubernetes/certs/kubecfg.crt --key /etc/kubernetes/certs/kubecfg.key  https://IP:PORT/api/v1/namespaces/kube-system/services/metrics-server:443/proxy/metrics
Client sent an HTTP request to an HTTPS server.

Any suggested way to get the latency of metrics-server itself?

Apr 28 '21 09:04 sanwishe

I don't know how to use service/proxy you are using to connect to HTTPS endpoint. Alternatives I know:

Connect to MS directly. Requires connection to be made from cluster network so might not work here.
Use pods/portforward instead. It can be done by using kubectl or writing some code.
- First setup proxy kubectl portforward -n kube-system metrics-server-pod 4443:4443 & Then curl local port curl localhost:4443/metrics
- Use this code https://github.com/kubernetes-sigs/metrics-server/blob/master/test/e2e_test.go#L255

Apr 28 '21 10:04 serathius

Thanks，this help a lot.

Apr 28 '21 10:04 sanwishe

Hi,@serathius @wojtek-t, we are commit pr https://github.com/kubernetes/perf-tests/pull/1797, could you please review if it works?

May 07 '21 06:05 yangjunmyfm192085

Done

May 07 '21 09:05 serathius

Hi, @wojtek-t, @marseel, @mborsz, @mm4tt, the pr https://github.com/kubernetes-sigs/metrics-server/issues/710 has merged. Do we need to discuss Deploy some dummy HPAs to put load on MS (should be discussed with scalability team)?

Jun 01 '21 14:06 yangjunmyfm192085

ping @wojtek-t @mborsz

Jun 11 '21 16:06 serathius

Hi, @wojtek-t, @marseel, @mborsz, @mm4tt, the pr #710 has merged. Do we need to discuss Deploy some dummy HPAs to put load on MS (should be discussed with scalability team)?

Deploying dummy HPAs is easy in a sense of deploying them. The two things we would like to figure out is:

how to avoid additional significant churn in the cluster (i.e. I would like to avoid no-negligible amount of scale-up/downs triggered by HPA)
at the same time, how to ensure that this actually useful and we check something (although from monitoring-server perspective only, maybe that's not critical)
and avoid changing any images/work characteristic that our pods (mostly dns-related or pause pods) are doing

Also adding @jkaniuk as he was thinking about that in a different context. @tosi3k @jprzychodzen - FYI

Jun 14 '21 13:06 wojtek-t

how to avoid additional significant churn in the cluster (i.e. I would like to avoid no-negligible amount of scale-up/downs triggered by HPA)

This can be done by setting replicaCount = maxReplicaCount = minReplicaCount, this way HPA just measures utilization, but doesn't take any action.

at the same time, how to ensure that this actually useful and we check something (although from monitoring-server perspective only, maybe that's not critical)

We can check if HPA utilization is calculated, if MS doesn't work values will not be set.

and avoid changing any images/work characteristic that our pods (mostly dns-related or pause pods) are doing.

We can use resource-consumer image maintained as part of test-infra. I propose this image as it will use non zero amount of CPU, so we can use check from second point. If not pause pods should also ok.

Jun 14 '21 14:06 serathius

/cc @jkaniuk @tosi3k @jprzychodzen

Jul 13 '21 12:07 yangjunmyfm192085

@yangjunmyfm192085 are there still things to be done as part of this issue?

Sep 16 '21 09:09 dgrisonnet

I think this issue has finished.

Sep 16 '21 10:09 yangjunmyfm192085

/close

Sep 16 '21 10:09 yangjunmyfm192085

@yangjunmyfm192085: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 16 '21 10:09 k8s-ci-robot

/cc @serathius

Sep 16 '21 10:09 yangjunmyfm192085

I don't agree with statement that this issue should be closed. In the original scope I proposed that we should document and how to run and use results from scalability tests. Without this step this work would be useless.

We need to have a way to include scalability tests in our release process, without this we just burn CPU for nothing.

Sep 16 '21 10:09 serathius

ok, I think I missed this step, I will continue to research about this.

Sep 16 '21 10:09 yangjunmyfm192085

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 15 '21 11:12 k8s-triage-robot

metrics-server metrics-server copied to clipboard

Automate testing scalability of Metrics Server.

metrics-server
metrics-server copied to clipboard