kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

feat: Add pod affinity/anti-affinity metrics for deployments

Open SoumyaRaikwar opened this issue 4 months ago • 21 comments

What this PR does / why we need it

Adds explicit rule-based pod affinity and anti-affinity metrics for deployments to provide granular visibility into Kubernetes scheduling constraints, addressing issue #2701.

Refactored from count-based to explicit rule-based approach following maintainer feedback for enhanced operational value.

Which issue(s) this PR fixes

Fixes #2701

Metrics Added

  • kube_deployment_spec_affinity - Pod affinity and anti-affinity rules with granular labels

Labels provided:

  • affinity - podaffinity | podantiaffinity
  • type - requiredDuringSchedulingIgnoredDuringExecution | preferredDuringSchedulingIgnoredDuringExecution
  • topology_key - The topology key for the rule
  • label_selector - The formatted label selector string

SoumyaRaikwar avatar Aug 12 '25 21:08 SoumyaRaikwar

How would you use this metric for alerting and/or showing information about the deployment?

mrueg avatar Aug 13 '25 08:08 mrueg

How would you use this metric for alerting and/or showing information about the deployment?

These metrics enable critical alerting on scheduling constraint violations. For example: (kube_deployment_spec_pod_anti_affinity_preferred_rules > 0) and (kube_deployment_spec_pod_anti_affinity_required_rules == 0) alerts when deployments rely only on soft anti-affinity rules that can be ignored during node pressure, creating single points of failure.

They also help monitor missing protection: (kube_deployment_spec_pod_anti_affinity_required_rules == 0) and (kube_deployment_spec_pod_anti_affinity_preferred_rules == 0) identifies deployments without any anti-affinity rules.

For dashboards, you can visualize cluster-wide scheduling health with count(kube_deployment_spec_pod_anti_affinity_required_rules > 0) to show how many deployments have proper distribution protection.

During incidents, these metrics help correlate why workloads ended up co-located or why pods failed to schedule due to overly complex constraints.

This addresses #2701's core need: visibility into "preferred vs required" scheduling logic to maintain reliable workload distribution during cluster events. Thanks @mrueg for the question - these use cases demonstrate the operational value of these scheduling constraint metrics!

SoumyaRaikwar avatar Aug 13 '25 09:08 SoumyaRaikwar

/triage accepted /assign @mrueg

CatherineF-dev avatar Aug 13 '25 17:08 CatherineF-dev

i think the metric should be explicit, something like:

kube_deployment_affinity{affinity="podaffinity", type="requiredDuringSchedulingIgnoredDuringExecution",topologyKey="foo",labelSelector="matchExpression foo in bar,baz"}  1

then you can count over these and get the desired result, as well as gather exactly that information about the specific affinity setting.

I'm not sure about the labelSelector at this point, if this should be split into subtypes as well or just calling https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/helpers.go#L171 is enough.

mrueg avatar Aug 14 '25 18:08 mrueg

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SoumyaRaikwar Once this PR has been reviewed and has the lgtm label, please ask for approval from mrueg. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 14 '25 19:08 k8s-ci-robot

Thanks @mrueg for the feedback! I understand you're looking for more explicit metrics that expose individual affinity rule details rather than just counts.

You're absolutely right that explicit metrics would provide much more granular visibility. Instead of simple count metrics

SoumyaRaikwar avatar Aug 14 '25 19:08 SoumyaRaikwar

Hi @mrueg,

I've successfully refactored the implementation to use explicit rule-based metrics as you requested.

Key Changes:

  • Replaced 4 count-based metrics with single kube_deployment_spec_affinity metric
  • Added granular labels for individual rule visibility and flexible querying
  • Used metav1.FormatLabelSelector() for consistent labelSelector formatting
  • Updated comprehensive tests and documentation

The new approach provides exponentially more operational value while maintaining low cardinality and following the individual object-level data principle from the best practices document.

SoumyaRaikwar avatar Aug 14 '25 20:08 SoumyaRaikwar

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: SoumyaRaikwar / name: Soumya Raikwar (e06e7039cefe75ead4279d508621da020d039595)

/check-cla

SoumyaRaikwar avatar Aug 29 '25 07:08 SoumyaRaikwar

Hi @mrueg, When you have a moment, could you please review my recent PR?

SoumyaRaikwar avatar Sep 11 '25 06:09 SoumyaRaikwar

Hi @mrueg, When you have a moment, could you please review my PR?

SoumyaRaikwar avatar Sep 19 '25 11:09 SoumyaRaikwar

Hi @CatherineF-dev, @logicalhan, and @rexagod — could you please review this PR when you have a chance?

SoumyaRaikwar avatar Sep 20 '25 08:09 SoumyaRaikwar

Hi @mrueg, can you review it please sir.

SoumyaRaikwar avatar Sep 21 '25 21:09 SoumyaRaikwar

Hi @mrueg — I’ve restored the deleted kustomization.yaml files in examples/autosharding and examples/standard.
Reverted the whitespace change in internal/store/deployment_test.go; could you please take another look?

SoumyaRaikwar avatar Sep 21 '25 23:09 SoumyaRaikwar

@mrueg @CatherineF-dev , could you please review my pr.

SoumyaRaikwar avatar Sep 23 '25 19:09 SoumyaRaikwar

@rexagod could you please review my pr

SoumyaRaikwar avatar Sep 26 '25 00:09 SoumyaRaikwar

@mrueg could you please review my pr when you have a chance

SoumyaRaikwar avatar Oct 04 '25 00:10 SoumyaRaikwar

@mrueg could you review my pr when you have a moment

SoumyaRaikwar avatar Oct 10 '25 07:10 SoumyaRaikwar

@mrueg please review my pr.

SoumyaRaikwar avatar Oct 12 '25 05:10 SoumyaRaikwar

@mrueg i have resolved all the merge conflict, could please review my pr ?

SoumyaRaikwar avatar Oct 13 '25 12:10 SoumyaRaikwar

@mrueg could you please review sir?

SoumyaRaikwar avatar Nov 13 '25 12:11 SoumyaRaikwar