Description

Add soft antiaffinity with topology key kubernetes.io/hostname with a weight of 50 to still give more priority to zone antiaffinity if multiple zones available

Issue reference

Please reference the issue this PR will close: https://github.com/dapr/dapr/issues/7665

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

[ ] Code compiles correctly
[ ] Created/updated tests
[ ] Unit tests passing
[ ] End-to-end tests passing
[ ] Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
[ ] Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
[ ] Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]

Apr 02 '24 04:04 filintod

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 61.83%. Comparing base (cd2df90) to head (c1370dc). Report is 1 commits behind head on master.

:exclamation: Current head c1370dc differs from pull request most recent head 3a8877d

Please upload reports for the commit 3a8877d to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7666      +/-   ##
==========================================
+ Coverage   57.04%   61.83%   +4.79%     
==========================================
  Files         480      245     -235     
  Lines       25982    22418    -3564     
==========================================
- Hits        14822    13863     -959     
+ Misses       9982     7393    -2589     
+ Partials     1178     1162      -16

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Apr 02 '24 05:04 codecov[bot]

@filintod, the kube-scheduler already scores based on Node locality for a replica set. Setting this value to me seems like we are just adding more processing time to the scheduling process. Same with the exiting zone affinity rule- do we actually need this?

Apr 02 '24 12:04 JoshVanL

@JoshVanL was seeing some pods for the same dapr system (ie dapr-sentry) to be running on the same nodes in some customers, even though it could also be there was a single node, but seem strange for prod systems. ~Where do you see this default node anti-affinity for replicasets, as that was not expected in the past, but maybe nowadays~, I saw in the code, but the score is also taking into account load of the nodes in general and we might really want to separate them onto different nodes, specially because we don't have guaranteed cos defined for them.

in relation to the zone anti-affinity, that was already there, it is kind of a common best practice for high availability, but more for stateful systems where you would care more about a whole zone going down, so probably not needed for all dapr systems (ie stateless). On the other hand if your load is spread in different zones, you might also get smaller latency talking to your in zone application, but latency should not be that big inte-zone but it is another consideration.

Apr 02 '24 13:04 filintod

@filintod I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

Apr 02 '24 14:04 JoshVanL

one thing is that you cannot say HA and have things running on the same node, and in some way for multi-zone, I need to check more on the score nowadays to see how weight is given to each

Apr 02 '24 14:04 filintod

I said uptime not HA 🙂 I see these as two separate things where uptime is a subset property of achieving HA, but HA also incorporating the idea of replication.

The Dapr control plane is not sensitive to churn like an application serving business traffic needing network failover would be. Single replica Dapr control plane can handle plenty, it just needs to be up, somewhere, and needs have higher priority of being up over consuming Dapr apps.

Apr 02 '24 14:04 JoshVanL

yes, I meant for you cannot say in term of dapr cannot say as we have here https://github.com/dapr/dapr/blob/master/charts/dapr/values.yaml#L29 HA and people might expect fault tolerance and ensuring spreading on different nodes/az is part of it where the scheduler might not give it as high priority as load.

Uptime is for sure important and we should find ways to make these services increase in priority for when one of them is going to be kick out to not be the one of the last in the list. but probably different to this PR.

Apr 02 '24 14:04 filintod

btw, actually thought I had seen the replica set node spread locality you mentioned but actually it was different. https://kubernetes.io/docs/reference/scheduling/config/ . So lmk if you can point to me where you see it. From the default ones, you could get balance if all nodes have somewhat balanced load allocatable, but overtime that is usually not feasible without some sort of rebalancing (that there are tools around like descaler). If that is the case, we do need to have antiaffinity to tilt the balance on not having all pods on the same node if possible.

Apr 03 '24 02:04 filintod

I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

Yeah, it is not be done by pod affinity. Maybe we can add a check in the ready endpoint in the sidecar to ensure daprd is running.

Apr 11 '24 02:04 daixiang0

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Aug 10 '24 16:08 dapr-bot

Try spread pods in HA to different nodes on single zone deployments

Description

Issue reference

Checklist

Codecov Report