dapr
dapr copied to clipboard
Try spread pods in HA to different nodes on single zone deployments
Description
Add soft antiaffinity with topology key kubernetes.io/hostname
with a weight of 50 to still give more priority to zone antiaffinity if multiple zones available
Issue reference
Please reference the issue this PR will close: https://github.com/dapr/dapr/issues/7665
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list:
- [ ] Code compiles correctly
- [ ] Created/updated tests
- [ ] Unit tests passing
- [ ] End-to-end tests passing
- [ ] Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
- [ ] Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
- [ ] Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 61.83%. Comparing base (
cd2df90
) to head (c1370dc
). Report is 1 commits behind head on master.
:exclamation: Current head c1370dc differs from pull request most recent head 3a8877d
Please upload reports for the commit 3a8877d to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## master #7666 +/- ##
==========================================
+ Coverage 57.04% 61.83% +4.79%
==========================================
Files 480 245 -235
Lines 25982 22418 -3564
==========================================
- Hits 14822 13863 -959
+ Misses 9982 7393 -2589
+ Partials 1178 1162 -16
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@filintod, the kube-scheduler already scores based on Node locality for a replica set. Setting this value to me seems like we are just adding more processing time to the scheduling process. Same with the exiting zone affinity rule- do we actually need this?
@JoshVanL was seeing some pods for the same dapr system (ie dapr-sentry) to be running on the same nodes in some customers, even though it could also be there was a single node, but seem strange for prod systems. ~Where do you see this default node anti-affinity for replicasets, as that was not expected in the past, but maybe nowadays~, I saw in the code, but the score is also taking into account load of the nodes in general and we might really want to separate them onto different nodes, specially because we don't have guaranteed cos defined for them.
in relation to the zone anti-affinity, that was already there, it is kind of a common best practice for high availability, but more for stateful systems where you would care more about a whole zone going down, so probably not needed for all dapr systems (ie stateless). On the other hand if your load is spread in different zones, you might also get smaller latency talking to your in zone application, but latency should not be that big inte-zone but it is another consideration.
@filintod I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.
I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.
one thing is that you cannot say HA and have things running on the same node, and in some way for multi-zone, I need to check more on the score nowadays to see how weight is given to each
I said uptime not HA 🙂 I see these as two separate things where uptime is a subset property of achieving HA, but HA also incorporating the idea of replication.
The Dapr control plane is not sensitive to churn like an application serving business traffic needing network failover would be. Single replica Dapr control plane can handle plenty, it just needs to be up, somewhere, and needs have higher priority of being up over consuming Dapr apps.
yes, I meant for you cannot say
in term of dapr cannot say
as we have here https://github.com/dapr/dapr/blob/master/charts/dapr/values.yaml#L29 HA
and people might expect fault tolerance and ensuring spreading on different nodes/az is part of it where the scheduler might not give it as high priority as load.
Uptime is for sure important and we should find ways to make these services increase in priority for when one of them is going to be kick out to not be the one of the last in the list. but probably different to this PR.
btw, actually thought I had seen the replica set node spread locality you mentioned but actually it was different. https://kubernetes.io/docs/reference/scheduling/config/ . So lmk if you can point to me where you see it. From the default ones, you could get balance if all nodes have somewhat balanced load allocatable, but overtime that is usually not feasible without some sort of rebalancing (that there are tools around like descaler). If that is the case, we do need to have antiaffinity to tilt the balance on not having all pods on the same node if possible.
I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.
I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.
Yeah, it is not be done by pod affinity. Maybe we can add a check in the ready endpoint in the sidecar to ensure daprd is running.
This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!