linkerd2 failover: Moving away from TrafficSplit as the Core?

I posted a thread in the Linkerd slack around this - https://linkerd.slack.com/archives/C89RTCWJF/p1646952488338639 but I'll summarise it here. The thoughts I had around this were it might be possible to align the gateway, proxy code and eliminate the need for a specific operator here - at the same time blending in all of the benefits like EWRR.

The main gripe I've got is that TrafficSplit is a very basic construct, and potentially one that's ill fitted to build this around. I had some ideas around potentially slightly tweaking how multi-cluster services are published.

Proposed Approach

Allow marking a service with a linkerd annotation (similar to the "exported" flag). This would be something like linkerd.io/multicluster-service-id: foo.
Gateways would track "are there any endpoints for foo locally" and publish a boolean state of up/down to other gateways.
This multicluster-service-id would allow for alignment of services across namespaces. (i.e. the source service, target service on each cluster don't need to share a name, namespace or anything. If their global ID is the same, the endpoints are mated, end of story).
The gateway in the "consumer" cluster would, upon seeing an incoming service from another cluster publish a port on itself as a destination entry on the services with a matching global-identifier (i.e. destination consumes a stream of these from the gateways) and relay to the remote instances.

In short, when a service is distributed in this fashion - gateway registers itself as an endpoint member of a service. If the gateway goes down, or the remote publishes "no endpoints available", it'd unregister with destination. Then all of the magic like EWRR load balancing happen transparently. From a balancing perspective, Linkerds built-in load balancing would blended traffic across this endpoint just like it was another pod. It would inherently prefer-local and prefer-nearest due to the latencies implied at a per-request level by going off cluster, providing a graceful balance of traffic.

Balancing vs Failover

Now, with that core in place there then becomes place for a second new annotation on a service. linkerd.io/multicluster-mode:. The options here would be:

balanced (default if unset) - The remotes are just treated like another pod target, and blended in.
failover - The remotes are only added to the destinations list in the consumer cluster if all local targets are unavailable.

Then if someone wants to do specific weighting of a service, they can use traffic-split in the way it was originally intended, and its not semantically loaded up with being a failover solution. Multicluster exports would continue to operate in support of this etc.

Scenarios

Now, let's look at some scenarios of how this might stack configuration wise.

Even Splits (50% local, 50% across remotes)

Balance evenly between local traffic and external cluster traffic.

Create service A locally for some pods.
Create service B locally, but referencing the multicluster-service-id of A remotely
Add a TrafficSplit resource that balances across services A and B with weights.

Local First, Failover Second

All traffic local, unless local pods are dead - then fail over.

Create service A locally for pods
Use service A multcluster-service-id on services on both ends of the gateway to inter-cluster publish the endpoints
Set the local A definition to use multicluster-mode: failover

Gaps

The one thing this doesn't solve for immediately is tiering/priority of failovers. (i.e. Local first, then clusters B/C second, then clusters D/E if the other three are all down). However this could be approached with another annotation on the service/namespace that implies the failover-priority of each cluster, and then endpoint resolution only includes a target if all lower levels are dead/empty.

This can evolve up to a CRD that guides/controls the process, similar to the Service profiles (or even, potentially, part of a service profile).

Mar 12 '22 10:03 steve-gray

Hi @steve-gray! There are some pretty interesting ideas in here but I'd like to take a step back and get some more specifics on why you think TrafficSplits are ill fitted for building failover. More specifically, what things would you like to be able to do that are difficult or not possible with the current failover implementation?

The failover extension was designed in such a way that it is completely separate from multicluster, but the two can compose naturally. This was done very intentionally and has a lot of benefits that we'd like to hold on to, if possible.

One thing that I'm hearing here is the ability to load balance over multiple remote clusters. I think this is squarely a multicluster feature, rather than a failover one, but it's very interesting. Currently, the multicluster extension treats each remote cluster (each represented by a Link resource) independently but I could definitely see the possibility of adding a new type of Link resource which mirrors services from multiple remote clusters and creates a local mirror service which load balances between them.

If this sounds like it's what you need, I'd recommend moving this over to an issue in the Linkerd2 repo for how to add multiple remote cluster load balancing to the multicluster extension. Otherwise, we can continue the discussion here with more specifics on what functionality you need from failover that's currently missing.

Mar 17 '22 22:03 adleong

I raised it here because this seemed to be where people are talking about/working on this concept. I believe from a maintainer side on Github you can just move the issue to the other repo/preserve the context of it all without me just copy pasting stuff.

I'd like to take a step back and get some more specifics on why you think TrafficSplits are ill fitted for building failover

It's not a standard that is intended to do failover.
It has no semantic support for representing failover.
Everything this operator does is basically abusing the standard or forcing a non-standard interpretation over it (i.e. "We'll change the 0 weights to something else/ignore that zero in some circumstances). Why use a standard CRD symbolically, if it's not being used in a standard way? What if you actually want to just down-rate a service to 0 for real because of a problem on that side whilst leaving the split in place?
Using this approach, it is architecturally impossible within Linkerd today to have a true balance across local and even a single remote, never mind clusters. If you did a split like this A local/B remote 50% each, the split would send 50% across to B regardless of whether or not that was a good idea, whether you had 50 pods on the remote or one and it was on fire, and it skips past all of the smarts that make Linkerds proxies great with regards to traffic management.

Some other corrections/clarifications:

This isn't about multiple remote clusters, everything I said is equally applicable for two.
No need for additional resource types/link resources. This can be achieved entirely with a new annotation that specifies a services "global" identity and informs linkerd of the equivalence of two services across disjoint clusters (i.e. what if I deploy it as A in namespace B on cluster 1, but its called C in namespace D on cluster 2? This idea solves for that).

The sum total of the work required:

Introduce a new concept annotation with the l5d prefix for "global service ID" annotation placed on a Kubernetes "Service".
Service mirror - When exporting a service across the mirroring/existing gateway, propagate that attribute. At this point then the objects created by the mirror leave the consuming cluster with two services with the same attribute. This is is the signal that "These are the same entity across clusters".
Destination - It already consumes service objects, now it just needs a small-ish change to identify these annotations, and commingle the associated service endpoints into the destinations list for service with the same global ID. The commingling is conditional on that flag I mentioned, making the behaviour "Always combine it into the list", or "Only combine it into the list, if the list is empty".

What you get a this point is:

Traffic Split can be just about A/B splitting traffic, like was meant to be.
The ability to do failover from one cluster to another (mode: failover) or....
The ability to create one, virtually distributed service across clusters where the gateway provides mTLS inter-cluster. The bias will implicitly be against the remotes due to latency, but for long-lived/heavy operations that latency penalty will be nominal, leading to very effective balancing.
You can use traffic-split as it was originally intended, and just balance across a local and export-mirrors of a service with fixed ratios, if thats what you really wanted to do.
You can actually use this feature to balance traffic inside one cluster. For example, if you had two versions of the same workload on different node groups, and waited to fail-over across between node groups in both directions etc (as it's not actually reliant, architecturally, on the "Service"s being joined being gateway mirrors.....)

In terms of futures/down the road:

To support concepts like sequenced/prioritized failover, we could go deeper on annotations on services (Simple, easy to do). i.e. We could do a service-failover-group=n, and the destinations would only merge into the list group-by-group as each prior/high priority group was depleted.
We could also extend it out with a CRD to describe this, but all this becomes is contextual information for destination when it manages its list of things.

Happy to cut code and help on this, and this change would be a substantial improvement and step-forward for Linkerds suitability for large environments, as it eliminates a whole series of inter-cluster communication problems and makes peoples environments better.

I think two annotation changes on a service resource, and minor changes to destination represent a more positive outcome than the complexity of another Kubernetes operator, especially where the operators main function is to facilitate the mis-use of a standard.

Mar 18 '22 22:03 steve-gray

I don't think I agree with the premise here that using TrafficSplits to implement failover is unintended, non-standard, or an abuse of the standard. A TrafficSplit is a declarative representation of how traffic should be divided and it can be used to implement many different logical behaviors including A/B tests, canary, and failover.

It sounds like what you're really after is the ability to have Linkerd load balance across a union of services, some of which may be remote. This is ALMOST possible today: you can construct a Service and Endpoints object which contains any combination of local addresses and/or remote gateway addresses. The unfortunate wrinkle is that metadata that describes how to communicate with the remote cluster such as remote-svc-fq-name is specified as annotations on the Endpoints object, which is a problem if a single Endpoints object contains a heterogeneous mix of local or distinct remote gateways.

What you're proposing here is a valid notion of failover, but it's quite different from the notion of failover that this extension implements. Of course, the beauty of the extension/controller pattern is that this controller was built entirely on top of existing primitives in Linkerd and SMI (traffic splits and service profiles) and did not require any changes to Linkerd itself. This means that it's always possible to build your own failover controller with you own notion of what failover means and how it behaves.... so long as Linkerd has all the necessary primitives.

Therefore, I recommend opening an issue on the Linkerd2 repo describing what primitives you would need to build a failover controller that operates the way that you have in mind. I think something along the lines of service unions or having some way to mix local and/or remote endpoints (potentially from multiple different remote clusters) in the same service is a good place to start.

I'd hesitate to move this existing issue since there's a lot of discussion about TrafficSplits and the current failover implementation which I think would confuse the issue.

Mar 19 '22 00:03 adleong

I don't think I agree with the premise here that using TrafficSplits to implement failover is unintended, non-standard, or an abuse of the standard

This is an interesting position, but I feel it's unsupported by the wording and clearness of the SMI specification or the nature of the data model they permit. I've taken the time to sit down and go through all four versions in case you were referring to something I'd missed, but I'm unable to locate it. My initial read was that the SMI specifications are basic in their scope and focus. This has led to them being adopted in some cases within Linkerd, but at other times clearly discarded and bypassed - such as Authorisation, which does have SMI standards available, but not ones that are workable).

I'm happy to be wrong - would you be able to point me to the relevant areas of the specification, where its implied that failover is one of the domains its attempting to address? Barring that, it would seem prudent to consider failover is an orthogonal problem. If anything needs to be something that happens in conjunction with splits of traffic - not in lieu of. Sometimes a horse is just a horse, and even alpha4 of the spec is a poor framework for describing a failover domain and its constraints.

you would need to build a failover controller

This is an incorrect read of what I've said. In what I've proposed there is zero need for a specific controller - in fact this would represent a substantial net delta downwards of code in the Linkerd as a whole if you count this repository being EOL'd as an approach. By using annotations to guide the way destination stacks endpoints for "things that are the same service", you eliminate all of the complexity of an operator. This reads the annotations propegated by multicluster, but does not actually tie any of the core to multicluster in terms of coupling either - in fact, as I said above, this would enable some new capabilities even inside a single cluster for workloads segregated by node-group/namespace.

The work I'm putting forward would represent potentially a minuscule percentage of the effort already vested into this feature/repo, and I'd like to invest the time so we can both arrive a middle ground and the efforts are pooled, versus two competing approaches, where the capabilities mirror. I'm naturally loathe to open another issue to split this conversation up, because all of the parties will be the same - and if you disagree with this here, its not going to get over the line over on the other repo, so why expend the keystrokes and fragment the discussion.

One of the core problems with Linkerd multicluster adoption is it does not have the same time-to-value as other features, and locking this capability in another extension of an extension complicates it. If there was a massive carrot, like proper failover support and service global distribution in the core and multicluster, then thats a massive carrot to wear the pain of adoption of multicluster itself.

Mar 19 '22 01:03 steve-gray

Not sure if this is the same discussion, but i think failover is a limiting view on this. The capability of doing a gradual service-by-service migration from cluster to cluster would be very valuable.

Apr 01 '22 16:04 eelcoh

Thats not really a linkerd-failover issue, thats actually something where you can essentially use trafficsplit as intended/designed. You just set up the split as per:

Initial Split
 - Local Instance - 100%
 - New Instance - 0%

Then update all your apps to talk to the service name of the split, instead of the raw services. Then you update the split to flip the percentages around, and boom, your traffic goes the other way. You can incrementally step it as you wish to test X% at a time etc.

The question then becomes if it's a one-off migration, why leave the traffic-split in place. They're useful long term for A/B testing or blue-green deploys, however if it's just a one-and-done migration and you've got no long term prospect to fail back, you could just update service-by-service to reference the remote exported names.

Apr 02 '22 11:04 steve-gray

So one can use traffic split to send traffic to pods on another cluster without changing consuming services? Wasn't aware of that. Thanks.

Apr 02 '22 17:04 eelcoh

There's a lot written here--too much to reply to all of the points individually.

Generally: we chose to use TrafficSplit because it was cheap and low risk to implement. This failover extension didn't require changes to the core control plane or the proxy: we built on top of the composable building blocks that already existed in the system. Is this approach perfect? No, there are always tradeoffs.

But, I think it's important to point out: this approach satisfies requirements we were given for . We have users who want this functionality as it was built. We targeted the simplest approach that would solve their problems.

I agree that in the fullness of time, we should have richer primitives to control this type of policy. These primitives will almost certainly be decoupled from the SMI extension. But, having the failover extension as it exists today solves real problems today. It's intensely pragmatic.

this would represent a substantial net delta downwards of code in the Linkerd as a whole

The work I'm putting forward would represent potentially a minuscule percentage of the effort already vested into this feature/repo,

I care a lot less about the total LOC and more about the complexity of each component. An extension allows us to add functionality purely by composition with minimal changes to existing infrastructure. Your proposal sounds like it would require, effectively, rewriting the destination controller. This is high-risk and, in my view, not the right time to do it--there will be many other considerations we want to include as we think through client-side configurations more holistically.

This is a trade-off. It's intentional. It's not permanent; but we think it's the right decision for the time being.

I think the most helpful path forward would be to table everything on the solution/implementation side of the discussion and focus on the deficiencies of abstract functionality.

If I've read the conversation properly, the main problem is that stochastic balancing is insufficient? Let's dig into that. What are the problems with a fixed traffic distribution?

May 03 '22 15:05 olix0r

My main gripe with this is that it uses a resource only really intended for A/B testing and blue-green scenarios as a generic cudgel to support a much more nuanced and enterprise oriented concept like "Failover" and HA/DR, and uses arbitrary semantics over that resource (In some scenario we make the 0 a 100 or vice versa) to try and effect an outcome that is at best a two-path failover model. It's not stochastic either, its binary (here or there, nothing between) - unless we want to play semantics on that term.

I genuinely struggle to imagine any scenario for HA/DR/Failover I've encountered in my professional career where the current failover implementation, or it's reasonably foreseeable next steps would have been a fit. If there's some cases/documentated users using it in earnest I'd love to hear, because I was very much pushing to try and get value out of multicluster and kept coming up short. A small additional concern is that the move to fragment Linkerd into these operators for core functionality, or at least in this case a slight aversion to extending the core itself where it makes sense is creating a complexity bomb and preventing uptake. Very specifically, multi-cluster as a whole has a horrible adoption to utility value-curve.

When I consider the landscape for HA/DR, I generally think along a few different scenarios:

Site Bypass - A data-centre/location/cluster(s) are having issues, and we want to throw traffic out the back of the network inter-site instead of processing locally (a site might also lose its comms entirely and go dark, or it could just have minor degradations). This operator works where N=2.
Hub and Spoke - Some services specifically run in some places, and other places talk to those hubs. MC today really supports only this model intrinsically.
Site Balancing - Process the workload as efficiently as possible, making best use of all equipment. This is the single area where Linkerd drives the most value for people in real-world, and is not serviced by this approach.

With site-balancing, theres' the cost metric too of intersite transport not being free, so the ability to opt into "Run locally, unless you can't" or "Run wherever is fastest"/"balance the world" would be great. There's clearly a validated need to prefer locality 'if available' - which is why stuff like endpoint-slices/topology awareness are making their way through the K8's committees and gears, obviously though - slowly.

I guess in short:

The core is already aware of endpoint slices as they were
What I proposed above (provide a metadata way to marry services up either side of the gateway, and then add them to the balancing pools for a service.....) removes this operators reason for existence, and provides support for all of the above scenarios.

In terms of end-user experience here, today:

Someone has N environments. They install Linkerd, things go vroom vroom faster.
They install multi-cluster
They install this failover operator
Spend eternity configuring trafficsplits per-service
Change all application configurations to reference the splits.

Versus:

Someone has N environments. They install Linkerd and again, things go faster.
They install multi-cluster
They set one annotation on their service objects and automatically they get either HA or seamless cross-site balanced traffic from the first-hop of the mesh to the last (or automatic failover, if the specific service component is down locally).

It's a far more graceful, intuitive experience if you didn't have to reconfigure your applications to take advantage of the meshes capabilities. Also the operator/MC generally sucks in terms of experience today because down-detection/handling of a target not being there flat out nowhere near as nuanced/good as the in-proxy handling of a given pod in a local service being down etc.

May 04 '22 20:05 steve-gray

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Aug 03 '22 00:08 stale[bot]

linkerd2 linkerd2 copied to clipboard

failover: Moving away from TrafficSplit as the Core?

Proposed Approach

Balancing vs Failover

Scenarios

Even Splits (50% local, 50% across remotes)

Local First, Failover Second

Gaps

linkerd2
linkerd2 copied to clipboard