kube-reqsizer icon indicating copy to clipboard operation
kube-reqsizer copied to clipboard

Optimization is not working - Azure AKS - v1.25.6

Open zohebk8s opened this issue 1 year ago • 19 comments

Hi Team,

First of all, it looks like a new tool and it can play an important role as well.

I just quickly tested it in Azure AKS v1.25.6. Below are my findings/comments:

  1. First, a small correction in the helm install command - We should use the name as well while installing.

helm install kube-reqsizer/kube-reqsizer --> helm install kube-reqsizer kube-reqsizer/kube-reqsizer

  1. I've deployed a basic application in the default namespace with high CPU/memory requests to test, whether kube-reqsizer will optimize or not. Waited for 22 mins, but still, it was the same.

  2. Logs FYR

I0530 15:58:39.252063 1 request.go:601] Waited for 1.996392782s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd I0530 15:58:49.252749 1 request.go:601] Waited for 1.995931495s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd I0530 15:58:59.450551 1 request.go:601] Waited for 1.994652278s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd I0530 15:59:09.450621 1 request.go:601] Waited for 1.994074539s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system I0530 15:59:19.450824 1 request.go:601] Waited for 1.99598317s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape I0530 15:59:29.650328 1 request.go:601] Waited for 1.993913908s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/tigera-operator I0530 15:59:39.650831 1 request.go:601] Waited for 1.996110718s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape I0530 15:59:49.850897 1 request.go:601] Waited for 1.995571438s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system I0530 16:00:00.049996 1 request.go:601] Waited for 1.994819712s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/calico-system I0530 16:00:10.050864 1 request.go:601] Waited for 1.991681441s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/default image

  1. How much time it will take to optimize? Will it restart the pod automatically?

  2. I haven't customized any values, just used the below commands to install.

helm repo add kube-reqsizer https://jatalocks.github.io/kube-reqsizer/ helm repo update helm install kube-reqsizer kube-reqsizer/kube-reqsizer

zohebk8s avatar May 30 '23 16:05 zohebk8s

Hey @zohebk8s , thanks for trying out the tool.

I've seen this occur to different people, and it seems like the kubeapi is too slow for the default configuration of the chart. For this, you need to set concurrentWorkers to 1.

This issue had the same problem as yours. Please see the correspondence here:

https://github.com/jatalocks/kube-reqsizer/issues/30

Thanks! And tell me how it went

ElementTech avatar May 30 '23 16:05 ElementTech

https://github.com/jatalocks/kube-reqsizer/issues/30#issuecomment-1566779576

ElementTech avatar May 30 '23 16:05 ElementTech

@jatalocks Thanks for your response.

I've updated the concurrentWorkers to "1" and the value of min-seconds in kube-reqsizer is also "1" as shown below: But still it's not updating the values. I am I missing something here?

image image

I've added below annotations for that deployment:

apiVersion: apps/v1 kind: Deployment metadata: name: app-deployment annotations: reqsizer.jatalocks.github.io/optimize: "true" # Ignore Pod/Namespace when optimizing entire cluster reqsizer.jatalocks.github.io/mode: "average" # Default Mode. Optimizes based on average. If ommited, mode is average reqsizer.jatalocks.github.io/mode: "max" # Sets the request to the MAXIMUM of all sample points reqsizer.jatalocks.github.io/mode: "min" # Sets the request to the MINIMUM of all sample points

zohebk8s avatar May 30 '23 16:05 zohebk8s

Hey @zohebk8s , can you send a screenshot of the logs now? (A few minutes after the controller has started working). It might take for it some minutes to resize

ElementTech avatar May 30 '23 17:05 ElementTech

Also, try adding the "optimize" annotation to the namespace this deployment is in

ElementTech avatar May 30 '23 17:05 ElementTech

I've added annotation to the default namespace, where this deployment is running. But still, the values are the same kube-reqsizer-controller-manager-795bbd7677-dl4xx-logs.txt and they didn't change.

The utilization of the pods is very normal and I was expecting a change/optimization from kube-reqsizer. In the request, I've specified below values: cpu: "100m" memory: 400Mi

I've attached the full log file FYR. Please refer attached txt file

image image

zohebk8s avatar May 30 '23 18:05 zohebk8s

@zohebk8s it appears it's working. If you gave it time through the night, did it eventually work? It might take some time on concurrentWorkers=1 but eventually it has enough data in cache to make the decision.

ElementTech avatar May 31 '23 08:05 ElementTech

From logs, it looks like work. But it's not resizing/optimizing the workload. Still, I see no changes in CPU/memory requests for that deployment. Usually, it should not take this much time to take action.

image

zohebk8s avatar May 31 '23 09:05 zohebk8s

@jatalocks Even if you see the cache sample is 278. Do you think this data is not enough for decision-making? Is there any specific amount of samples it will collect and take decision?

image

zohebk8s avatar May 31 '23 09:05 zohebk8s

That's odd, it should have worked immediately. I think something prevents it from allowing it to resize. What's your values/configuration? You should make sure minSeconds=1 and sampleSize=1 as well.

ElementTech avatar May 31 '23 09:05 ElementTech

The configuration should match what's on the top of the Readme (except concurrentWorkers=1)

ElementTech avatar May 31 '23 09:05 ElementTech

Already it's "1" for concurrent-workers, minSeconds & sampleSize.

It's Azure AKS - v1.25.6 and the default namespace is istio injected. I hope it's not something specific to Istio.

configuration:

spec: containers: - args: - --health-probe-bind-address=:8081 - --metrics-bind-address=:8080 - --leader-elect - --annotation-filter=true - --sample-size=1 - --min-seconds=1 - --zap-log-level=info - --enable-increase=true - --enable-reduce=true - --max-cpu=0 - --max-memory=0 - --min-cpu=0 - --min-memory=0 - --min-cpu-increase-percentage=0 - --min-memory-increase-percentage=0 - --min-cpu-decrease-percentage=0 - --min-memory-decrease-percentage=0 - --cpu-factor=1 - --memory-factor=1 - --concurrent-workers=1 - --enable-persistence=true - --redis-host=kube-reqsizer-redis-master

zohebk8s avatar May 31 '23 09:05 zohebk8s

What are resource requirements for the deployments in default namespace? The only thing I could think of is that it doesn't have anything to resize so it just continues sampling the pods. Also if there are no requests/limits to begin with there's nothing to resize from. I'd check that the pods are configured with resources

ElementTech avatar May 31 '23 10:05 ElementTech

I've defined requests/limits for this deployment and the utilization is very less, that's the reason I thought of raising this question/issue.

image image

If it doesn't have requests/limits, then as you said it won't work. But in case, I've defined requests/limits and the CPU/memory utilization is very less as well.

zohebk8s avatar May 31 '23 10:05 zohebk8s

I see that reqsizer is alive for 11 minutes. I'd give it some more time for now and I'll check if there's a specific problem with AKS

ElementTech avatar May 31 '23 10:05 ElementTech

@jatalocks Thank you for your patience and response, as I feel like this product can make a difference if it works properly. As it's more related to resource optimation which is directly proportional to cost optimization.

zohebk8s avatar May 31 '23 10:05 zohebk8s

@jatalocks Is it a bug? Or kind of enhancement required at product level?

I hope the information which I’ve shared is of help.

zohebk8s avatar Jun 02 '23 18:06 zohebk8s

@zohebk8s I think that by now if the controller has been continuously running the app should have already been resized

ElementTech avatar Jun 02 '23 19:06 ElementTech

@ElementTech i see that @zohebk8s seem to be using argo-cd in this cluster, can be that argo-cd is directly undoing all the changes done on the resources of the Deployment?

darkxeno avatar Feb 09 '24 18:02 darkxeno