descheduler icon indicating copy to clipboard operation
descheduler copied to clipboard

Add to metrics the pod name which was evicted

Open mczimm opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like The metrics for premetheus should have pod name of application which was evicted

Describe alternatives you've considered

What version of descheduler are you using?

descheduler version: v0.24.0

Additional context

mczimm avatar Aug 18 '22 13:08 mczimm

For the pods_evicted metric, I'm not sure what the use case is for this. Won't you just end up with a bunch of new time series (one for each pod name), all with a value of 1? For statefulsets or fixed-name pods that could be more useful, but I think for most users running generated pod names this will blow up their dashboards.

To me it would seem more useful to add labels for things like nodeName, or something that can actually be aggregated. If you are specifically looking to track your evicted pod names, it may be better to use a logs parser since we already provide that info in a structured format

damemi avatar Aug 18 '22 17:08 damemi

Hi @damemi, I'll show you what I mean.

In the metrics I see this

descheduler_pods_evicted{cloud="cloud", cluster="cluster", container="descheduler", endpoint="http-metrics", instance="10.10.28.101:10258", job="descheduler", namespace="back", node="cl13khna1utejao014uh-oqar", pod="descheduler-64f4ccfdb7-52fn4", prometheus="vm/vm-victoria-metrics-k8s-stack", result="success", service="descheduler", strategy="PodTopologySpread"}

and with this info I can't understand which service pod was evicted? So if I'll have the service pod name which was evicted I'll be able to make an alert for example or chart grouped by pods.

mczimm avatar Aug 19 '22 10:08 mczimm

Yeah, descheduler_pods_evicted is meant as an aggregated count of the total pods evicted. So right now it would show an overall value like

descheduler_pods_evicted{...} 123

or since it's already grouped by node:

descheduler_pods_evicted{...node="node-a"} 5
descheduler_pods_evicted{...node="node-b"} 12
descheduler_pods_evicted{...node="node-c"} 8

(or namespace)

But if you add pod names into that as a label, then in clusters that use generated pod names, you're going to get all unique time series like

descheduler_pods_evicted{...,pod="my-pod-f43x7t} 1
descheduler_pods_evicted{...,pod="my-pod-e92nbf} 1
descheduler_pods_evicted{...,pod="my-pod-aay0dk} 1
descheduler_pods_evicted{...,pod="my-pod-pw890d} 1
...

So I am just confused by how you would handle that. Maybe if you could put together a PR with a sample demo of it in practice (like a screenshot) it will be clearer?

damemi avatar Aug 19 '22 13:08 damemi

See also: https://prometheus.io/docs/practices/naming/#labels

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

The way Prometheus works, each of these individual pod names basically becomes a new metric, and they are all stored permanently in memory (in the descheduler's Prometheus handler). So this would be like a big memory leak, as it would be constantly growing and storing all of these single-point time series

damemi avatar Aug 19 '22 14:08 damemi

Screenshot 2022-08-22 at 11 49 10

mczimm avatar Aug 22 '22 08:08 mczimm

@mczimm thanks, yes I understand what you are asking for. See my comment above, where following the Prometheus docs this is probably not a good idea given the high cardinality of pod names.

damemi avatar Aug 22 '22 12:08 damemi

Hi @damemi. What about to keep not an uniq pod name but app-name in the descheduler's Prometheus handler?

mczimm avatar Sep 13 '22 15:09 mczimm

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 12 '22 15:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 11 '23 16:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Feb 10 '23 17:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 10 '23 17:02 k8s-ci-robot