descheduler Add to metrics the pod name which was evicted

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like The metrics for premetheus should have pod name of application which was evicted

Describe alternatives you've considered

What version of descheduler are you using?

descheduler version: v0.24.0

Additional context

Aug 18 '22 13:08 mczimm

For the pods_evicted metric, I'm not sure what the use case is for this. Won't you just end up with a bunch of new time series (one for each pod name), all with a value of 1? For statefulsets or fixed-name pods that could be more useful, but I think for most users running generated pod names this will blow up their dashboards.

To me it would seem more useful to add labels for things like nodeName, or something that can actually be aggregated. If you are specifically looking to track your evicted pod names, it may be better to use a logs parser since we already provide that info in a structured format

Aug 18 '22 17:08 damemi

Hi @damemi, I'll show you what I mean.

In the metrics I see this

descheduler_pods_evicted{cloud="cloud", cluster="cluster", container="descheduler", endpoint="http-metrics", instance="10.10.28.101:10258", job="descheduler", namespace="back", node="cl13khna1utejao014uh-oqar", pod="descheduler-64f4ccfdb7-52fn4", prometheus="vm/vm-victoria-metrics-k8s-stack", result="success", service="descheduler", strategy="PodTopologySpread"}

and with this info I can't understand which service pod was evicted? So if I'll have the service pod name which was evicted I'll be able to make an alert for example or chart grouped by pods.

Aug 19 '22 10:08 mczimm

Yeah, descheduler_pods_evicted is meant as an aggregated count of the total pods evicted. So right now it would show an overall value like

descheduler_pods_evicted{...} 123

or since it's already grouped by node:

descheduler_pods_evicted{...node="node-a"} 5
descheduler_pods_evicted{...node="node-b"} 12
descheduler_pods_evicted{...node="node-c"} 8

(or namespace)

But if you add pod names into that as a label, then in clusters that use generated pod names, you're going to get all unique time series like

descheduler_pods_evicted{...,pod="my-pod-f43x7t} 1
descheduler_pods_evicted{...,pod="my-pod-e92nbf} 1
descheduler_pods_evicted{...,pod="my-pod-aay0dk} 1
descheduler_pods_evicted{...,pod="my-pod-pw890d} 1
...

So I am just confused by how you would handle that. Maybe if you could put together a PR with a sample demo of it in practice (like a screenshot) it will be clearer?

Aug 19 '22 13:08 damemi

See also: https://prometheus.io/docs/practices/naming/#labels

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

The way Prometheus works, each of these individual pod names basically becomes a new metric, and they are all stored permanently in memory (in the descheduler's Prometheus handler). So this would be like a big memory leak, as it would be constantly growing and storing all of these single-point time series

Aug 19 '22 14:08 damemi

Screenshot 2022-08-22 at 11 49 10

Aug 22 '22 08:08 mczimm

@mczimm thanks, yes I understand what you are asking for. See my comment above, where following the Prometheus docs this is probably not a good idea given the high cardinality of pod names.

Aug 22 '22 12:08 damemi

Hi @damemi. What about to keep not an uniq pod name but app-name in the descheduler's Prometheus handler?

Sep 13 '22 15:09 mczimm

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 12 '22 15:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 11 '23 16:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 10 '23 17:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 10 '23 17:02 k8s-ci-robot

descheduler descheduler copied to clipboard

Add to metrics the pod name which was evicted

descheduler
descheduler copied to clipboard