rancher icon indicating copy to clipboard operation
rancher copied to clipboard

Rancher reporting cpu/mem reserved and pod count wrong

Open erSitzt opened this issue 3 years ago • 22 comments

Rancher Server Setup

  • Rancher version: 2.6.3
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RK2
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.21.5+rke2r1 / v1.21.7+rke2r2
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Imported RKE2

Describe the bug I have three more or less empty clusters deployed with RKE2, only one of them seems to be reporting a correct value for reserved cpu/mem. And pods seem to be missing too, but only in the rancher home screen...

image

As you can see rke2-downstream2 cluster is reporting no reservations at all and test-rk2 seems to report too much reservations for an empty cluster

This is the output of the resource-capacity plugin if krew: rke2-downstream2:

❯ kubectl resource-capacity
NODE                        CPU REQUESTS   CPU LIMITS    MEMORY REQUESTS   MEMORY LIMITS
*                           8020m (44%)    4120m (22%)   2387Mi (3%)       5323Mi (7%)
rke2-downstream2-agent-1    900m (22%)     1300m (32%)   284Mi (1%)        630Mi (3%)
rke2-downstream2-agent-2    1850m (46%)    1700m (42%)   1341Mi (7%)       4119Mi (24%)
rke2-downstream2-agent-3    820m (20%)     420m (10%)    252Mi (1%)        284Mi (1%)
rke2-downstream2-server-1   1450m (72%)    200m (10%)    126Mi (1%)        53Mi (0%)
rke2-downstream2-server-2   1450m (72%)    200m (10%)    126Mi (1%)        53Mi (0%)
rke2-downstream2-server-3   1550m (77%)    300m (15%)    261Mi (3%)        187Mi (2%)

and test-rke2:

❯ kubectl resource-capacity
NODE                 CPU REQUESTS   CPU LIMITS   MEMORY REQUESTS   MEMORY LIMITS
*                    6070m (33%)    220m (1%)    856Mi (1%)        290Mi (0%)
rke2-agent-node-1    600m (15%)     0Mi (0%)     95Mi (0%)         0Mi (0%)
rke2-agent-node-2    600m (15%)     0Mi (0%)     95Mi (0%)         0Mi (0%)
rke2-agent-node-3    600m (15%)     0Mi (0%)     95Mi (0%)         0Mi (0%)
rke2-server-node-1   1570m (78%)    220m (11%)   384Mi (4%)        290Mi (3%)
rke2-server-node-2   1350m (67%)    0Mi (0%)     95Mi (1%)         0Mi (0%)
rke2-server-node-3   1350m (67%)    0Mi (0%)     95Mi (1%)         0Mi (0%)

All server nodes are 2 CPU and agent nodes 4 CPU vms btw.

rke2-downstream2 has monitoring installed, test-rke2 has not

image image

erSitzt avatar Dec 29 '21 09:12 erSitzt

btw the issue in the rke2-downstream2 cluster existed even before i tried the upgrade to v1.21.7+rke2r2 via rancher ui

erSitzt avatar Dec 29 '21 10:12 erSitzt

image image

Same issue with an aks cluster on 1.21.7 and Rancher 2.6.3 Other cluster deployed with rke also on 1.21.7 are reporting correctly

Yannis100 avatar Jan 04 '22 10:01 Yannis100

https://rancher-addreess.domain.com/v1/management.cattle.io.cluster is returning all zeroes for the clusters in question so its not an display issue..

image

erSitzt avatar Jan 04 '22 11:01 erSitzt

Deployed another RKE2 cluster and imported it... same result

erSitzt avatar Jan 06 '22 09:01 erSitzt

Recreated the RKE2 cluster which was reporting correct values (terraform) and now it is not reporting any CPU/MEM/Pod values

It could be that my two working clusters were imported when Rancher was still on 2.6.2 while all the other cluster were imported after the update to 2.6.3 ,but im not 100% sure

erSitzt avatar Jan 06 '22 11:01 erSitzt

image Same issue with RKE1 for me. Working stats built with 2.6.2 Non-working stats built with 2.6.3

semaforce-sean avatar Jan 07 '22 07:01 semaforce-sean

Seems to be getting weirder... my recreated cluster now started to report values... that seem a little off :)

image

erSitzt avatar Jan 07 '22 13:01 erSitzt

And i just compared the Pod count of all clusters that are reporting values... none of them are correct

Are those numbers filtered or averaged ? or excluding some "system" pods ?

So there are my numbers in the home screen image

And this is in the cluster itself image

image image

erSitzt avatar Jan 07 '22 13:01 erSitzt

Hi,

We have the same trouble : image

Our version is Rancher 2.6.3, we don't have the problem with 2.6.2. The monitoring is installed on Downstream and not on Local.

dtrouillet avatar Jan 13 '22 06:01 dtrouillet

Same issue. image

silentdark avatar Jan 13 '22 19:01 silentdark

We have the same problem. Reservations and Limits are only shown for the local cluster I suspect this change to be the culprit: https://github.com/rancher/rancher/commit/3453a429bf4107dde095dfcf0256daf93ec6ffb3

This utilizes annotations on the v1/Node to detect current limits and reservations instead of calculating it based on the pods. These annotations do actually exist however they don't seem to get correctly synced to the "management.cattle.io/Node" resource.

apiVersion: v1
kind: Node
metadata:
  name: master-server1
  annotations:
    ...
    management.cattle.io/pod-limits: '{"cpu":"300m","memory":"178Mi"}'
    management.cattle.io/pod-requests: '{"cpu":"1725m","memory":"2211Mi","pods":"19"}'
    ....

In my case the limits and requested from this resource is only available for the local cluster:

apiVersion: management.cattle.io/v3
kind: Node
metadata:
  name: machine-laqe1
  namespace: c-m-1r131swx
  ...
  limits:
    cpu: 120m
    memory: 148Mi
  requested:
    cpu: 745m
    memory: 243Mi
    pods: '21'

WolfspiritM avatar Jan 14 '22 11:01 WolfspiritM

I have a similar issue, except it's not showing 0%, nor is the effect anywhere but in the cluster that was upgraded to 1.22 (vs the others 1.21). Lens-IDE shows all the values correctly.

Rancher:

image

vs Lens:

image

siegenthalerroger avatar Mar 04 '22 09:03 siegenthalerroger

Might be same as or related to https://github.com/rancher/rancher/issues/36229

dnoland1 avatar Mar 10 '22 23:03 dnoland1

This also happens with the AKS provider. On the image, clusters that show counts had been imported a while ago, while all newly imported with kubernetes v1.21.7 show wrong counts.

shot_220330_114154

fgielow avatar Mar 30 '22 14:03 fgielow

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar May 30 '22 02:05 github-actions[bot]

remove-stale

imriss avatar Jun 03 '22 16:06 imriss

I guess the commit 3453a429bf4107dde095dfcf0256daf93ec6ffb3 introduced the sync issue, remove annotate management.cattle.io/nodesyncer will trigger a force sync per the nodessyncer logic

CLUSTER_ID="the cluster id"
for machine in $(kubectl get nodes.management.cattle.io -n $CLUSTER_ID  -o name --no-headers);do
 kubectl annotate -n $CLUSTER_ID $machine  management.cattle.io/nodesyncer-
done

fengxx avatar Jun 08 '22 04:06 fengxx

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Aug 08 '22 02:08 github-actions[bot]

We see the same issue. memory requests are not shown.

bilde

ronnyaa avatar Aug 18 '22 13:08 ronnyaa

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Oct 18 '22 02:10 github-actions[bot]

In our case right now the memory might be right but the "max memory" is wrong. We have more then 23GB of ram.

image

libreo-abrettschneider avatar Oct 21 '22 15:10 libreo-abrettschneider

In our case right now the memory might be right but the "max memory" is wrong. We have more then 23GB of ram.

image

Same here Screen Shot 2022-10-25 at 00 05 50

nilber avatar Oct 25 '22 03:10 nilber

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

github-actions[bot] avatar Dec 25 '22 01:12 github-actions[bot]

please reopen

erSitzt avatar Jan 09 '23 15:01 erSitzt

I got the same issue with k3s --version k3s version v1.27.7+k3s2 (575bce76) go version go1.20.10 Can we reopen the bug ? and how to fix ?

yodatak avatar Mar 17 '24 20:03 yodatak