krr Recommendations missing for some sidecars

Describe the bug We do not get a recommendation for some containers if there are multiple of them in a pod, e.g. the Dapr sidecar. For other pods with multiple containers, e.g. the prometheus pod with the thanos-sidecar and config-reloader sidecars, we do get recommendations. The missing sidecars are not mentioned in any logs.

Expected behavior Recommendations for all containers in a pod or at least a mention the there is no / not enough data to make a recommendation for that specific container.

May 08 '24 10:05 Pionerd

Hi, do you see anything when running krr w/ --verbose

Any pattern to the sidecars? E.g. init containers vs regular contianers.

May 12 '24 07:05 aantn

Hi @aantn, --verbose also does not give a mention of the missing containers.

They are regular containers in the sense that they are mentioned under containers in the spec of the pod. However, as they are 'injected' by Dapr (https://docs.dapr.io/concepts/dapr-services/sidecar/#kubernetes-with-dapr-sidecar-injector) they are not mentioned in the applicable deployment spec.

May 13 '24 15:05 Pionerd

Got it. So they are present in kubectl get pod <name> -o yaml but not in kubectl get deployment <name> -o yaml. Is that correct?

May 13 '24 17:05 aantn

Yes, correct.

May 14 '24 16:05 Pionerd

Does it work on https://github.com/robusta-dev/krr/pull/266 if you choose the Prometheus loader? That will load info about workloads from Prometheus where the data on actual containers in the pod should be present.

May 14 '24 17:05 aantn

python krr.py simple -p http://localhost:9090 -m prometheus where localhost exposes the Prometheus service does indeed work!

Of course, we would prefer to have recommendations being based on the longer term Thanos data, but it looks like we can give the Thanos URL als Prometheus URL, is that correct?

May 15 '24 13:05 Pionerd

Yes, exactly! Can you try and confirm that it works?

May 15 '24 13:05 aantn

My first few tests gave identical recommendations using both Thanos and Prometheus, but I found another cluster where Prometheus didn't contain enough data to give recommendations but Thanos did :) So yes, it works. Looking forward to seeing this merged and available via brew.

May 15 '24 13:05 Pionerd

Wonderful, thank you. To help get this merged, are you able to run a 3 way test:

Results on master (NOT the new branch) - this will obviously exclude the extra containers but I'd like to check if the other data matches up
Results on the new branch but WITHOUT the prometheus discovery flag. Just the standard API server discovery. Again, this will exclude the extra containers but we want to confirm that it is identical (or similar to) (1)
Results on the new branch with the extra prometheus discovery flag

We're doing testing to confirm that there aren't unexpected regressions in the new branch which includes some very large changes to the codebase. So getting another confirmation on this from real world data would be extremely valuable.

May 15 '24 14:05 aantn

When performing the test, I actually observed that the current data without prometheus discovery is wrong for the pods that contain the multiple containers as discussed in this issue. This goes for both the master and the new branch:

According to the report, the both current request and memory limit for the main container are set at 1024Mi, while actually the request is 256 and the limit 512 (missing sidecar container 128/256). This error happens for multiple (but strangely enough not all) containers with this Dapr sidecar pattern. This makes this issue slightly more pressing than that there are just some containers missing.

Next to this, the 'discovery' option give me a lot of recommendations for resources that were not present in both non-discovery tests, but that is probably the point of this whole exercise :)

Apart from that, the recommendations given by all three are equal.

May 15 '24 16:05 Pionerd

Thanks. Can you share the lines from the reports which are wrong? Sounds like the prom discovery is working but we should also get that bug fixed for the api server method. (We likely will continue to exclude injected containers from the report, but the data on other containers should be correct.)

On Wed, May 15, 2024, 19:19 Pieter @.***> wrote:

When performing the test, I actually observed that the current data without prometheus discovery is wrong for the pods that contain the multiple containers as discussed in this issue. This goes for both the master and the new branch:

According to the report, the both current request and memory limit for the main container are set at 1024Mi, while actually the request is 256 and the limit 512 (missing sidecar container 128/256). This error happens for multiple (but strangely enough not all) containers with this Dapr sidecar pattern. This makes this issue slightly more pressing than that there are just some containers missing.

Next to this, the 'discovery' option give me a lot of recommendations for resources that were not present in both non-discovery tests, but that is probably the point of this whole exercise :)

Apart from that, the recommendations given by all three are equal.

— Reply to this email directly, view it on GitHub https://github.com/robusta-dev/krr/issues/276#issuecomment-2112966399, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADYUB3QSYQ7GJNWLFRKJU3ZCODJNAVCNFSM6AAAAABHMWYZ4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSHE3DMMZZHE . You are receiving this because you were mentioned.Message ID: @.***>

May 16 '24 05:05 aantn

@aantn I am not sure if is the same issue, but on my side recommendation is also missing for some sidecars with regular deployments (not using Dapr at all):

{
  "object": {
    "cluster": "my-cluster",
    "name": "argocd-repo-server",
    "container": "argocd-vault-plugin-helm",
    "pods": [...],
    "hpa": {
      "min_replicas": 2,
      "max_replicas": 3,
      "current_replicas": 2,
      "desired_replicas": 2,
      "target_cpu_utilization_percentage": 50.0,
      "target_memory_utilization_percentage": 50.0
    },
    "namespace": "argocd",
    "kind": "Deployment",
    "allocations": {
      "requests": {
        "cpu": 0.01,
        "memory": 188743680.0
      },
      "limits": {
        "cpu": null,
        "memory": 314572800.0
      },
      "info": {}
    },
    "warnings": []
  },
  "recommended": {
    "requests": {
      "cpu": {
        "value": "?",
        "severity": "UNKNOWN"
      },
      "memory": {
        "value": "?",
        "severity": "UNKNOWN"
      }
    }
  }
}

I remove the pods from the JSON for readability but I have a lot of them in the list. I see nothing specific in the debug logs concerning this specific pod, but all 7 sidecars (which are argocd plugins) have the same issue.

Also I checked that the following metrics (adapted from the documentation) are indeed returning data directly from Prometheus:

sum(container_memory_working_set_bytes{namespace="argocd", pod=~"argocd-repo-server-.*",container="argocd-vault-plugin-helm"})
sum(irate(container_cpu_usage_seconds_total{namespace="argocd", pod=~"argocd-repo-server-.*", container="argocd-vault-plugin-helm"}[5m]))

Jun 18 '24 11:06 headyj

@headyj what is the output with https://github.com/robusta-dev/krr/pull/266 ?

Jun 18 '24 12:06 aantn

The exact same result, I'm afraid :-(

This is the command I executed:

python3.12 krr.py simple --logtostderr  -c my-cluster  --namespace argocd  --verbose  -m prometheus  -p http://localhost:9090  -f json --history_duration 720

Nothing specific from the logs neither:

[...]
DEBUG    Adding historic pods for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm                                                                                                 prometheus_metrics_service.py:224
[...]
DEBUG    Gathering PercentileCPULoader metric for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm                                                                                 prometheus_metrics_service.py:198
[...]
DEBUG    Gathering MaxMemoryLoader metric for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm                                                                                     prometheus_metrics_service.py:198
[...]
DEBUG    Gathering CPUAmountLoader metric for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm                                                                                     prometheus_metrics_service.py:198
[...]
DEBUG    Gathering MemoryAmountLoader metric for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm                                                                                  prometheus_metrics_service.py:198
[...]
INFO     Calculated recommendations for Deployment argocd/argocd-repo-server/argocd-vault-plugin-helm (using 4 metrics)                                                                                             runner.py:179

Jun 18 '24 12:06 headyj

The containers appear in output, but there are no recommendations for them, correct? (You get a ? instead.)

Do they have an HPA defined as in the example you shared? If so, does --allow-hpa help?

Jun 18 '24 12:06 aantn

Indeed you're right there's an HPA for that deployment, of which I was not aware of... My bad, I should have checked that in the first place as obviously it's working perfectly with --allow-hpa (no need to set --mode)

Thanks again @aantn !

Jun 18 '24 13:06 headyj

Wonderful, happy to hear it!

Jun 18 '24 14:06 aantn

I'm running into a similar issue with injected sidecars, but for me it's for kuma

Sep 27 '24 21:09 voidlily

Hi @aantn, I'm not getting recommendations (the containers don't appear in output) for sidecar container istio-proxy (Service Mesh)

Sep 30 '24 02:09 louise-zhang

@louise-zhang does it work on the #266 branch?

Sep 30 '24 06:09 aantn

The #266 branch worked for kuma sidecars as well, but it's a bit annoying to have to portforward and specify prometheus urls manually where before it could just autodetect the prometheus for my cluster: krr simple vs portforward prometheus and krr simple --mode prometheus -p http://localhost:9090

Sep 30 '24 16:09 voidlily

Got it, thanks for the feedback. Its unfortunately not something we're going to fix as Prometheus mode was designed to work without any connection to the cluster.

Sep 30 '24 17:09 aantn

@louise-zhang does it work on the #266 branch?

Hi @aantn, thanks, I can confirm that it works on #266 branch for sidecar container istio-proxy (Service Mesh), I can see recommendations for those sidecar containers. What's the ETA for this release?

Oct 01 '24 22:10 louise-zhang

Hi @louise-zhang, glad to hear it works. There is no ETA for release yet. I will update here when we have one. You can keep using from that branch though!

Oct 02 '24 14:10 aantn