integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

kubernetes_state.node.count does not get the node labels from K8s

Open alexbowers opened this issue 2 years ago • 6 comments

Describe the results you received: CleanShot 2022-07-22 at 12 17 29

Describe the results you expected: I would expect kubernetes_state.node.count to have the labels that are passed from the node, so that I can get the number of nodes that are within each node-group for monitoring.

Additional information you deem important (e.g. issue happens only occasionally): As you can see kubernetes_state.node.age (and others) have the node-group name and other information that I want to use. CleanShot 2022-07-22 at 12 17 41

alexbowers avatar Jul 22 '22 11:07 alexbowers

hi @alexbowers

the fact that kubernetes.node.count is not labeled with node label is the current expected behaviour because this metric is an aggregation (count) of nodes, so it can't have specific node's labels. We currently aggregate the nodes by: "kubelet_version", "container_runtime_version", "kernel_version", "os_image" You can see the current implementation here

kubernetes.node.age is not an aggregation because we provide the age for each node.

Please let us know if we can help in another way.

Thanks and regards Cedric

clamoriniere avatar Aug 22 '22 13:08 clamoriniere

Can some way of defining specific labels from nodes to be put onto the aggregate metrics so that for example, you can aggregate by environment be considered?

As it stands, the aggregation isn't useful to us at all, because it combines our staging, QA, and production environments together and pollutes the actual data that we'd be looking for.

If there was a way for us to say "include env label in aggregation only" that would solve this problem for us.

alexbowers avatar Aug 22 '22 14:08 alexbowers

Hey, I had similar issue, solved it by using the following to get node count per nodegroup "sum:kubernetes_state.node.by_condition{kube_cluster_name:cluster-name,condition:ready,status:true} by {k8s-nodegroup}"

13013SwagR avatar Nov 21 '22 18:11 13013SwagR

Great workaround, thanks @13013SwagR.

I just popped in to mention that this issue affected us as well - we have a number of monitors in which we aggregate kubernetes_state.node.count by aws_autoscaling_groupname, disappearance of this label was a fairly unwelcome surprise :(

drmaciej avatar Jan 16 '23 01:01 drmaciej

Hi @drmaciej

We now provide a set of "service checks" to represent the different "standard" Node conditions:

  • kubernetes_state.node.ready
  • kubernetes_state.node.out_of_disk
  • kubernetes_state.node.disk_pressure
  • kubernetes_state.node.network_unavailable
  • kubernetes_state.node.memory_pressure

See: https://docs.datadoghq.com/integrations/kubernetes_state_core/?tab=helm#service-checks

Because these service checks generate a status of each node and they are attached to the corresponding host, All the host tags can be use to "group by" in the monitor.

Screenshot 2023-01-16 at 19 22 28

clamoriniere avatar Jan 16 '23 18:01 clamoriniere

Thanks @clamoriniere, that makes sense.

I actually do not see kubernetes_state.node.out_of_disk or kubernetes_state.node.network_unavailable in my environments (I do see the other 3). Are those expected to show up only when there is no disk space or the network is not available?

drmaciej avatar Jan 17 '23 03:01 drmaciej