kube-prometheus Replace podAntiAffinity addon by a topologySpreadConstraints addon?

What is missing?

I stumbled upon this new recommendation reading the karpenter.sh documentation^karpenter-concepts:

Note Don’t use podAffinity and podAntiAffinity to schedule pods on the same or different nodes as other pods. Kubernetes SIG scalability recommends against these features and Karpenter doesn’t support them. Instead, the Karpenter project recommends topologySpreadConstraints to reduce blast radius and nodeSelectors and taints to implement colocation.

More in the official documentation about the motivations: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#comparison-with-podaffinity-podantiaffinity

It may not be worth to pursue improvements to the podAntiAffinity addon like #1090.

Thoughts?

Why do we need it?

Follow ever evolving Kubernetes best practices

Environment

kube-prometheus version:

Insert Git SHA here

Anything else we need to know?:

anti-affinity addon: https://github.com/prometheus-operator/kube-prometheus/blob/6d013d4e4f980ba99cfdafa9432819d484e2f829/jsonnet/kube-prometheus/addons/anti-affinity.libsonnet

Dec 06 '21 20:12 maxbrunet

Can you share more about sig scalability requirements about this? I cannot find anything much info related to use of affinity vs topologySpreadConstraints. The only thing I found is KEP for introducing the feature and info that affinity and topologySpreadConstraints can work together.

From what I see topologySpreadConstraints matter more when you want to have multiple constraints using multiple failure domains (something not easily possible with affinity). However, in our case, we are using only one failure domain (node) which begs to ask what is the benefit of switching to a different solution?

Don't get me wrong, I like the idea of using topologySpreadConstraints, but I just want to know more ;)

Dec 07 '21 11:12 paulfantom

I am at the same point as you on all that. Reading the KEP further, PodAntiAffinity seem to cause them issues and they are even thinking about deprecating it in the linked issue:

Currently PodAntiAffinity supports arbitrary topology domain, but sadly this causes a slow down in scheduling (see Rethink pod affinity/anti-affinity). We're evaluating solutions such as limit topology domain to node, or internally implement a fast/slow path handling that. If this KEP gets implemented, we can simply achieve the semantics of "PodAntiAffinity in zones" via a combination of "Even pods spreading in zones" plus "PodAntiAffinity in nodes" which could be an extra benefit of this KEP.

From this, it seems to be more about how it works internally, rather than the potential gain in flexibility for the users. My guess is, when the Karpenter documentation refers to the "Kubernetes SIG scalability" recommendation, it is not about the scalability of your deployment, but the scalability of your Kubernetes Scheduler

Dec 07 '21 23:12 maxbrunet

they are even thinking about deprecating it in the linked issue

I'm not sure I see anything concrete in that issue which states that there will be any deprecation to affinity or anti-affinity. It is an almost 3 year old issue.

In fact I see various related enhancements landing in 1.22 https://github.com/kubernetes/enhancements/issues/2249

To be clear, I am not saying we shouldn't consider this or it is a bad idea. I just want to be sure we are not being motivated by the wrong source.

Dec 08 '21 09:12 philipgough

I got the Karpenter documentation clarified aws/karpenter#948

So in short, pod anti-affinities are a strain for the Kubernetes Scheduler at the moment They does not seem to be a clear decision on whatever they will be deprecated or improved, but topologySpreadConstraints are here as an alternative

I have finally found the literal recommendation

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity

Now the questions for kube-prometheus:

Are we concerned about the Kubernetes Scheduler performance?
Do we want a topologySpreadConstraints addon?
Do we want to remove the podAntiAffinity addon if doing 2.?

Dec 09 '21 00:12 maxbrunet

Are we concerned about the Kubernetes Scheduler performance?

I would say yes. If we can improve our users' environments by using a better performing feature, we should use it.

Do we want a topologySpreadConstraints addon?

I vote yes.

Do we want to remove the podAntiAffinity addon if doing 2.?

No, I think this would break too many workflows. We can however put a trace notice in the podAntiAffinity addon to advertise the usage of the new topologySpreadConstraints addon. Additionally, we would need to create better documentation about topology spread and usage of podAntiAffinity as well as topologySpreadConstraints with an explanation on why not to use podAntiAffinity.

By trace notice I mean something like this: https://github.com/prometheus-operator/kube-prometheus/compare/main...paulfantom:trace-notice?expand=1

cc @simonpasquier @PhilipGough you might be interested in this as AFAIR CMO is using anti-affinity heavily.

Dec 09 '21 09:12 paulfantom

I have no issues with us adding support for the addon while retaining support for inter pod affinity.

Yes CMO is using podAntiAffinity for both AlertManager and the platform Prometheus. I know we have deployed on very large cluster but am not aware of any issues with scheduling performance.

I'd be curious to learn if the impact would be more substantial for high-churn workloads, as opposed to relatively static platform infra. Or if adding podAntiAffinity, podAffinity to a workload affects overall scheduling performance across the cluster in general or is just limited to those workloads.

Dec 10 '21 09:12 philipgough

@paulfantom sounds good, thank you for pointing to std.trace(), I had been wondering how to output things like deprecation warning in Jsonnet for some time.

I am going to experiment with topologySpreadConstraints in my clusters, and I will try to contribute an addon later

Dec 11 '21 01:12 maxbrunet

This seems already implemented. https://prometheus-operator.dev/docs/operator/api/

Should this issue be closed?

May 12 '23 22:05 migueleliasweb

Karpenter implemented enough Pod Anti-Affinity support to satisfy the prometheus use case. It was released in v0.9.0

May 12 '23 23:05 joebowbeer

@migueleliasweb I think that this feature request is still valid because the ask is to implement a jsonnet addon that uses topologySpreadConstraints and this doesn't exist yet (though the prometheus-operator CRDs support it).

May 16 '23 08:05 simonpasquier

kube-prometheus kube-prometheus copied to clipboard

Replace podAntiAffinity addon by a topologySpreadConstraints addon?

kube-prometheus
kube-prometheus copied to clipboard