enhancements
enhancements copied to clipboard
KEP-3015: Node-level topology
-
One-line PR description: new version of KEP-3015, replacing the old "
PreferLocal
traffic policy" idea with "node-level topology" -
Issue link: #3015
-
Other comments: See previous discussion of the original
PreferLocal
idea in #3016. We agreed there that this would make more sense as topology than as traffic policy, hence this PR.
/sig network /cc @robscott @andrewsykim @thockin
I have not fully read this revised proposal yet, but considering this and https://github.com/kubernetes/kubernetes/issues/110714 at the same time, maybe I am just wrong about this not being an ITP value. It certainly is the easiest API.
E.g. internalTrafficPolicy: PreferLocal
-> prefer same node if possible, else prefer same zone, else cluster.
That handles the Service-side API (the service producer expressing how they want the service to be consumed, which still feels icky but maybe OK for special cases. I guess kube-proxy would consider that before even looking at hints?
Having written the KEP both ways now, it feels more topology-like than traffic-policy-like to me. Especially, you can't use the feature properly just by enabling it on the v1.Service
; you have to also take steps to ensure that your endpoints are distributed in a useful way across the cluster, such that routing connections to local endpoints will actually be the right thing. (ie, you have to either deploy the endpoints as a DaemonSet so they're available everywhere, or you need to use selectors / affinity / taints to ensure the clients and endpoints end up together.)
OK, trying to summarize the discussion:
- We're still not totally in agreement about what the difference between "traffic policy" and "topology" is, and whether this would be better as the former or the latter. (But I can easily close this PR and reopen the
iTP: PreferLocal
one if we like that better.) - We can possibly autodetect that node-local topology would be useful in the DNS case ("there's an endpoint deployed by DaemonSet to every single node"), but (a) there are probably no other cases that would be as easy-to-detect as that; and (b) DNS is deployed by the admin or installer so it's not a big deal to require manual configuration for that case anyway, and (c) there are other cases where the clients and servers are both restricted to a subset of nodes where the feature would be useful if it was enabled but kubernetes can't plausibly figure out that it would be useful to enable. So manual configuration is better than automatic, at least for now.
- This feature is part of a continuum that extends from "I want a Service available across multiple clusters, but traffic should stay within clusters if possible" (always true with multi-cluster services?) to "I want a Service available throughout a single cluster, but traffic should stay within zones if possible" (existing Topology-Aware Hints) to maybe "I want a Service on some/all nodes, but traffic should stay within subsets that I define if possible" (eg, "rack-local topology") (doesn't currently exist?), to "I want a Service on some/all nodes, but traffic should stay on the node it starts on if possible" (PreferLocal/Node-Level Topology).
- This feature is kind of like
internalTrafficPolicy: Local
and kind of not likeinternalTrafficPolicy: Local
...- The same continuum of cluster-level / zone-level / user-defined-level / node-level distinctions could also theoretically exist for
iTP:Local
-style services (ie changing "if possible" to "or fail" in all examples above).
- The same continuum of cluster-level / zone-level / user-defined-level / node-level distinctions could also theoretically exist for
So... (spitballing...) what if we added service.Spec.TopologyLabels
which is an array of labels, and if set, then kube-proxy will prefer that clients get routed to an endpoint on a node with the same value for all of those labels as the client's node. so that topologyLabels: ["kubernetes.io/metadata.name"]
would imply node-level topology, and topologyLabels: ["topology.kubernetes.io/zone"]
would be similar to service.kubernetes.io/topology-aware-hints: Auto
. Or maybe rather than an array where all the labels have to match, it could be an array where first it tries to match the first label, and if it can't do that, then it falls back to trying to match the second label, etc. So then you can have "prefer local but fall back to zone".
(Alternate: have a Topology.k8s.io
type that defines a kind of topology, and then you can just say service.Spec.Topology: ["Node", "Rack", "Zone"]
(or topology: NodeWithZoneFallback
?) referring to the Topology
objects that provide full definitions.)
And maybe internalTrafficPolicy: Local
could become internalTrafficPolicy: RequireTopology
meaning, whatever the TopologyLabels
say, that's a requirement rather than just a preference.
Relative to current Topology-Aware Hints, this loses some of the trying-to-balance-things stuff, so we'd need to incorporate that too...
So... (spitballing...) what if we added service.Spec.TopologyLabels which is an array of labels, and if set, then kube-proxy will prefer that clients get routed to an endpoint on a node with the same value for all of those labels as the client's node. so that topologyLabels: ["kubernetes.io/metadata.name"] would imply node-level topology, and topologyLabels: ["topology.kubernetes.io/zone"] would be similar to service.kubernetes.io/topology-aware-hints: Auto. Or maybe rather than an array where all the labels have to match, it could be an array where first it tries to match the first label, and if it can't do that, then it falls back to trying to match the second label, etc. So then you can have "prefer local but fall back to zone".
For what it's worth, the very initial design of topology aware routing was like this, except the field was called service.spec.topologyKeys
. We deleted this field in Alpha since in many cases the traffic is not safely distributed (hence the current version of topology aware routing).
The "prefer node local" case was one of the primary drivers of this since you can do something like:
topologyKeys:
- "kubernetes.io/hostname"
- "*"
But we figured the "prefer node local" case was more of a special case that could be codified separately, which is why we created the internalTrafficPolicy
field, with the assumption that most other cases for topologyKeys
would be covered by topology aware routing automatically. At the time we thought topology aware routing would handle node-level topology too, but for good reason we opted out of that (at the time, topology aware routing used endpointslice subsetting which meant we would create an endpointslice per node). So given that context, I think I'm still in favor of internalTrafficPolicy: PreferLocal
if node-level topology requires signifcant architectural changes to topology-aware routing (can't comment on this though as I'm not too familiar with topology-aware routing).
So for the node-level case it's easy to ensure that your endpoints are distributed correctly because you just use a DaemonSet. Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology...
So given that context, I think I'm still in favor of internalTrafficPolicy: PreferLocal if node-level topology requires signifcant architectural changes to topology-aware routing (can't comment on this though as I'm not too familiar with topology-aware routing.
I'd agree with this approach. From a purely practical perspective, setting hints only makes sense when the value of the hint could be different than where the endpoint is. For example, with Topology Aware Hints, if there are 3 endpoints all in one zone, but nodes are equally distributed across 3 zones, each of those endpoints will be assigned to a different zone with a hint.
We could theoretically take a similar approach for preferNode. In an example where there are 3 endpoints on 1 node and 2 nodes without any endpoints, we could assign 1 endpoint to each node with hints. I don't think that approach really has any value though, and it certainly is not solving the DNS use case @danwinship identified here.
That means the only reasonable approach appears to be "if endpoint(s) exist on the same node, forward there, otherwise fall back to default routing across cluster." To me, that approach does not seem to gain any value from being tied to the hints name or architecture. Similarly, we could have the same approach for zone, but populating hints again feels unnecessary since they're not required for a proxy implementation to make an endpoint filtering decision.
Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology...
I'm not sure of the specifics here, but agree that we need to look into what's possible on the scheduling side. I know for topology aware hints, I need to spend some time talking with sig-scheduling and sig-autoscaling to see if we can try to provision new Pods with zone distribution taken into account.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: danwinship
Once this PR has been reviewed and has the lgtm label, please ask for approval from thockin and additionally assign wojtek-t for approval by writing /assign @wojtek-t
in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
So I feel like we don't have enough consensus here to move forward in 1.26? (We don't even have consensus on if it should be topology or traffic policy...)
So I feel like we don't have enough consensus here to move forward in 1.26? (We don't even have consensus on if it should be topology or traffic policy...)
@danwinship I think prototyping both approaches could shed some light on which approach we should take. Maybe we can try to build a barebones implementation for both in the next couple of weeks and file an exception once we agree on the best approach?
Atleast for iTP, it should be pretty easy to implement PreferLocal
since the field is already there and we can re-use the fallback logic from ProxyTerminatingEndpoints
. Not sure how much work is involved for node topology. Happy to help with either one though
Not sure prototyping would really help? We only have one use case, and for that use case, both approaches would yield identical results...
Atleast for iTP, it should be pretty easy to implement PreferLocal
Yes, it's trivial. I think I already have it in a branch somewhere...
We only have one use case, and for that use case, both approaches would yield identical results...
but the use case is the DNS one as "I have one endpoint per node and fallback if it is not available on that node", no?
I personally don't think that you can generalise that use case in an API, well, no unless you solve all the "policy", "topology", "traffic engineering" , ... discussion 😄 For that specific scenario it works, but if you start adding combinations of endpoints and nodes 🤯
I also will challenge is this use case is a deployment/orchestration problem and not a networking problem "I have a singleton in one node and I want to roll it back without disruption" 😈
I'd love to approve this for 1.26 and we can argue whether it is policy or topology via prototypes, but it seems PRR is not addressed yet, so my approval alone won't make a difference?
I started trying to address John's PRR comments, but without knowing what approach we are going to take, every answer ends up being "we will do this in the way that makes sense given the approach we end up choosing". (Eg, if we have automatically-detected topology, then we will need a lot more testing (and a lot more different kinds of tests) than if we have iTP:PreferLocal.)
I don't think we need to try to force this into 1.26, as long as we keep talking about it, rather than forgetting about it again until the week before 1.27 feature freeze. :slightly_smiling_face:
I don't think we need to try to force this into 1.26, as long as we keep talking about it, rather than forgetting about it again until the week before 1.27 feature freeze. slightly_smiling_face
One thing to consider here though -- if we are promoting ServiceInternalTrafficPolicy
to GA in v1.26, it might make sense to fold iTP=PreferLocal
into that feature (if we choose that approach). We could introduce a new feature gate just for iTP=PreferLocal
but having a dedicated feature gate just to toggle the value of an existing GA field doesn't seem ideal.
I'm still not clear on why we would add another feature when we have an existing feature that solves the same use case. Is there something preventing its use in some cases? Are there other actual use cases for the more generalized feature?
While I would leave it to SIG Network to make that call (ie, this is not a blocker from a PRR perspective), I think it warrants more consideration.
@andrewsykim lol, I had originally proposed PreferLocal as a modification to the existing InternalTrafficPolicy KEP (#3010) but then closed that and replaced it with the separate KEP because, at that time, we thought we'd want externalTrafficPolicy: PreferLocal
too, so it made more sense as a new KEP adding the value to both traffic policy fields. But then we decided it wasn't necessary and didn't make sense for eTP...
But anyway, I don't think we're ready to say right now that we want to add internalTrafficPolicy: PreferLocal
, so it doesn't seem like something we can add to the iTP KEP as it's going to GA.
so it doesn't seem like something we can add to the iTP KEP as it's going to GA.
I think we should reconsider GA-ing ITP in v1.26 if we are adding a new policy type (PreferLocal). I would want that to soak in Beta for at least one release. If we decide this behavior is better with node topology then I think we should GA in v1.26.
Our use-case as requested.
We were using
topologyKeys:
- kubernetes.io/hostname
- '*'
as also mentioned here
Our use case is video delivery.
Video chunks get delivered via url, which hits traefik, to two different pod layers. Traefik is the ingress manager, and to get everything working right, we create an externaname which directs to the the correct service, with the topologykeys. All services downstream used topologykeys.
We could see in the network monitoring that all video delivery traffic was being confined to the same node. This setup was great in production, and we never had a case of outage for video delivery. On the rare occasion a pod would die on the same host the video delivery change was occurring, it would then be temporarily redistributed to another pod on the cluster until it came back up.
So for the node-level case it's easy to ensure that your endpoints are distributed correctly because you just use a DaemonSet. Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology...
So it was pointed out in the SIG meeting that this is basically describing Pod Topology Spread Constraints. So the answer here may be that we need to find a way to make Service endpoint selection interact with that feature
Also, after looking at the NodeLocal DNS Cache stuff, I realized that the DNS use case actually has two odd properties:
- you want DNS traffic to stay on the same node
- you may want a rule vaguely like
iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK
(except with the correct port which may not be 53, and only applying to pod-to-CoreDNS traffic) to avoid conntrack entries for DNS, because they don't help, and they can hurt if you have lots of DNS traffic. - (ok, I can't count, but this is sort of a future use case), if you're using GRO offload to improve pod-to-pod UDP throughput, you would want to disable it for (client-side) DNS, since for DNS you always want low latency more than high throughput. (AFAIK that feature is not currently per-connection-disable-able, but that might change in the future.)
So anyway, something that is specific to DNS may be a better way to optimize DNS than Node-level Topology is...
As for what's wrong with NodeLocal DNS Cache:
- it implements "require local" rather than "prefer local", making it more complicated to do a live upgrade of the DNS pods
- it assumes that traffic to
169.254.0.0/16
from a pod will be routed to the host network namespace, which is not guaranteed (and which would be unambiguously disallowed for some pod network topologies). - it reinvents Kubernetes Services ("let's have a virtual IP that redirects to a pod IP") rather than trying to properly leverage Kubernetes Services.
So I would say that we probably should keep the idea of "NodeLocal DNS Cache" as its own feature, but it should be rebased on top of node-level topology in the future, rather than using the current link-local IP hack.
Hi folks,
I was forwarded multiple times to this KEP to discuss zone-level topology.
I raised the topic in the last SIG Network meeting (on 10.11). You can find the recording of the discussion in https://www.youtube.com/watch?v=MSaRwvYuAQc#t=40m25s (starts at 40:25
).
@robscott, in the last SIG Network meeting you shared that ServiceTopology
(topologyKeys
) can be widely complicated to implement having in mind that ServiceTopology
allowed things to be configurable in too many ways. I can agree with you. We are not definitely asking to bring back ServiceTopology
and we are not looking for support for arbitrary topology keys. We are looking for PreferZone
routing - if there are endpoints in the same zone route to those, otherwise fallback to the default routing behaviour (or pick any endpoint). In the SIG Network meeting you mentioned that you can get behind this feature request. Do you have any implementation proposal (maybe outside of this KEP) or was that meaning that you only support this feature request without a concrete implementation proposal?
@thockin you shared that you want to hear from people why they want PreferZone
as opposed to letting the automatic stuff work (TopologyAwareHints
). I don't want to repeat myself - we listed many reasons in https://github.com/kubernetes/kubernetes/issues/113731 and https://github.com/kubernetes/kubernetes/issues/110714 why we cannot benefit from TopologyAwareHints
. But if I need to list them again (and repeat myself):
- In our case the CPU allocation of zones does not model the set of clients which will access the Service. Our clients are well-known and setup is much simpler. We use Pod Topology Spread constraint and usually the server run with 3 replicas (spread evenly on 3 zones) and the client runs with 3 replicas as well (again spread evenly on 3 zones). We use cluster-autoscaler and we have no means to ensure that the CPU allocation of each zone will be equivalent.
TopologyAwareHints
simply fails and cannot handle this case. -
TopologyAwareHints
falls apart for a Service with less than 10 Endpoints - something that was stated by @robscott in the last SIG Network meeting. You also shared "Hints do not work at low-endpoint counts." - Hints are not deterministic. In one moment you can have hints assigned. With cluster-autoscaler running and load increasing in the next moment you can have the hints removed. People need to have the option to enforce zone-level topology. People can leverage VerticalPodAutoscaler as safe-guard mechanism and VerticalPodAutoscaler should take care to scale up an overloaded replica. Hence, we are not concerned of overloading a replica as we use VerticalPodAutoscaler.
- ... (for more see https://github.com/kubernetes/kubernetes/issues/113731 and https://github.com/kubernetes/kubernetes/issues/110714)
@danwinship I definitely don't want to hijack your KEP. I was just forwarded multiple times to this KEP as the right place to discuss the topic. Do you want to add the PreferZone
option in this KEP? What are the still open points/questions to tackle in this KEP?
As we talked in the SIG Network meeting there is Pod Topology Spread Constraints - a feature that allows you to spread your workload across topology (node, zone, ...). It is interesting idea to make the Service endpoint selection to interact with that feature. @danwinship can you elaborate more what you have in mind?
TL;DR: Cross-zone traffic is charged by cloud providers and comes with high latency. I believe a lot of people are asking and will be asking for PreferZone
- just search for TopologyAwareHints in the #sig-network channel, people are mainly complaining about it. Let me know if I can help somehow to move the things forward.
Just to add on to that, was directed here from #3642 where I was also asking for "prefer Zone/region" style topology service routing, very much agree with everything @ialidzhikov said and would appreciate any movement we can make here/clarity around where we should be following up on.
@danwinship can you elaborate more what you have in mind?
I haven't dug into all the details of that feature, but the vague idea was that you'd set some flag on the service to enable the "Service Topology Based On Pod Topology" feature, and then kube-proxy on Node A would try to only route traffic to an endpoint on Node B if Node A and Node B were in the same "topology domain" according to the endpoint's pod.spec.topologySpreadConstraints
.
So, pretty similar to the old service topologyKeys
, but taking advantage of the fact that the scheduler should be ensuring even load across topology domains for us, so we don't have to worry that we'll accidentally distribute service traffic in a pathologically unbalanced way.
@danwinship I like that idea a lot, couple of thoughts:
- Is it important that if you set this at a cluster level using a scheduler configuration that the
topologyKeys
are missing from the pod specs? I'd still want the topology aware routing for pods scheduled with these defaults. Perhaps kube-proxy would use thetopologyKeys
specified in thedefaultConstraints
values of the scheduler config in the event that the pod doesn't have its owntopologySpreadConstraints
set explicitly? This would match the scheduling behaviour as well.
then kube-proxy on Node A would try to only route traffic to an endpoint on Node B if Node A and Node B were in the same "topology domain" according to the endpoint's pod.spec.topologySpreadConstraints.
- To further clarify/expand - it feels like you'd want to ensure that the topologyKey it matches this on is configurable per service? So maybe for Service A I want to ensure the endpoint is on the same node, Service B just cares about zone (maybe due to storage), and Service C just cares about region (due to latency). You then would likely want different failure options (faill open/shut). Drawing inspiration from the scheduling config that would look something like
service.spec.topologyRoutingConstraints
(with all(?) the fields from the scheduler version). For example, this would"try local node, else zone, else fail"
:
spec:
topologyRoutingConstraints:
- topologyKey: hostname
whenUnsatisfiable: RouteAnyway
- topologyKey: zone
whenUnsatisfiable: DoNotRoute
Maybe we don't need a DoNotSchedule
equivalent on the network side, but its an interesting thought - e.g. for compliance reasons you don't want network traffic crossing zone/region borders, or maybe you just never want traffic to leave a node.
Where as this would try "region" then just anything:
spec:
topologyRoutingConstraints:
- topologyKey: region
whenUnsatisfiable: RouteAnyway
- Feel like I should close #3642 against this - please let me know your thoughts @danwinship / @robscott / @thockin
spec: topologyRoutingConstraints: - topologyKey: hostname whenUnsatisfiable: RouteAnyway - topologyKey: zone whenUnsatisfiable: DoNotRoute
No, I wasn't thinking that you'd specify the topology key in the Service. On the Service, you would just say
usePodTopology: true
The actual topology rules would only be specified on the (endpoint) pods; the scheduler would use those rules (as it does now) to spread out the pods, and kube-proxy would use the same rules to decide whether a given client pod could get routed to a given endpoint pod or not.
If we want to add the optional/required distinction (which I'm not sure we do; I feel like internalTrafficPolicy: Local
is a very different thing from topology), then we'd have a distinction between
usePodTopology: IfPossible
and
usePodTopology: Always
or something like that. (I spent like 5 seconds coming up with these names. We'd want to think about it a little bit longer than that.)
Oh I see, I guess your point is that there should never be a divergence between the scheduling topology constraints and the desired routing topology constraints, so we only have to specify on the pods not both? That makes sense...
If we want to add the optional/required distinction (which I'm not sure we do; I feel like internalTrafficPolicy: Local is a very different thing from topology)
Maybe we can just use the constraint set on the pod as we are with the topology keys ? E.g. as above just match the behaviour set by the whenUnsatisfiable
field.
The actual topology rules would only be specified on the (endpoint) pods; the scheduler would use those rules (as it does now) to spread out the pods, and kube-proxy would use the same rules to decide whether a given client pod could get routed to a given endpoint pod or not.
This is an interesting idea, but I want to be sure that we're not requiring kube-proxy (or other dataplane implementations) to watch all pods. If we want to derive routing preferences from scheduling preferences, it seems like we'd likely need to add some subset of that to EndpointSlices. A concern I'd have is that scheduling preferences can be different for each Pod backing a Service, which could make this rather complicated to implement.
I feel like internalTrafficPolicy: Local is a very different thing from topology
It seems like InternalTrafficPolicy: Local
is roughly the same as a same-node topology spread?
topologySpreadConstraints:
- topologyKey: hostname
whenUnsatisfiable: DoNotSchedule
And the PreferLocal
variation of that would just switch whenUnsatisfiable
to ScheduleAnyway
.
it seems like we'd likely need to add some subset of that to EndpointSlices
ah, right
A concern I'd have is that scheduling preferences can be different for each Pod backing a Service
Yeah, "don't do that then". I think there's a strong expectation that topologySpreadConstraints
would only be used in the pod template in a Deployment
or the like, not on bespoke Pod
s. And then presumably you'd use the same selector for the service as you used on the deployment. But anyway, we can just say that if every endpoint pod doesn't have the same topologySpreadContraints
, then you can't use usePodTopology
.
It seems like
InternalTrafficPolicy: Local
is roughly the same as a same-node topology spread?
I meant that "traffic will ideally stay on the same node" feels like topology to me, but "traffic is semantically required to stay on the same node" does not.
I meant that "traffic will ideally stay on the same node" feels like topology to me, but "traffic is semantically required to stay on the same node" does not.
I think a lot of this discussion is focused on a bit of a gray area. I think what @ialidzhikov and @LAMRobinson are really asking for is similar to what was previously described as a "PreferLocal" option, but for zones. Essentially "Traffic is required to stay on the same node/zone unless there are no local endpoints". To me that still feels a bit closer to policy than topology, but I can see the argument either way. In terms of implementation, I'd imagine it would be near identical to our existing TrafficPolicy implementations though.