gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Locality Based Routing Support

Open tanujd11 opened this issue 2 years ago • 10 comments

Description: Implement locality based routing support by default in EG. Now that we we can have individual endpoints as backend to EG. Can we support region/zone/subzone based routing based on EndpointSlice information, node labels etc.?

tanujd11 avatar Sep 27 '23 17:09 tanujd11

Hey @tanujd11 from a user perspective can you share what you like to happen on the data plane ( from gateway to multiple backend endpoints with different topology info )

arkodg avatar Sep 27 '23 18:09 arkodg

I understand this is very useful for optimizing East West traffic within a cluster, is that also the case for north South ?

arkodg avatar Sep 27 '23 18:09 arkodg

I think for an Envoy gateway running in us-east-1/us-east-1a should prefer the same zone backend to prevent cross zonal traffic. I think this behaviour could be made as default as cross zone communication is obviously costly. WDYT?

tanujd11 avatar Sep 27 '23 18:09 tanujd11

thanks, here's something more to think about

  • opt in or default - K8s service has a opt in annotation https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/#enabling-topology-aware-routing service.kubernetes.io/topology-mode
  • Is this a soft preference (weighted) or hard preference (priority)

arkodg avatar Sep 27 '23 21:09 arkodg

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Oct 28 '23 00:10 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Dec 02 '23 16:12 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jan 22 '24 00:01 github-actions[bot]

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

arkodg avatar May 23 '24 01:05 arkodg

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

Could be an option when this new field is stable and corresponding K8s version is adopted by massive companies.

Before that, IMO it's better to do load balancing accross endpoints in the cluster via Envoy's capability.

Currently EG has implemented locality weighted load balancing ^1, one BackendRef is translated to one LocalityLbEndpoints.

locality := &endpointv3.LocalityLbEndpoints{
	Locality: &corev3.Locality{
		Region: fmt.Sprintf("%s/backend/%d", clusterName, i),
  	},
	LbEndpoints: endpoints,
	Priority:    0,
}
  
// Set locality weight
var weight uint32
if ds.Weight != nil {
	weight = *ds.Weight
} else {
	weight = 1
}

Actually endpoints inside a LocalityLbEndpoints may be running in different zone, cross zone cost can't be saved in this way.


Through Envoy's capability, priority levels ^2 or zone aware routing ^3 ^4 can archive the goal to save cross zone cost.

priority levels

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, means which zone Envoy Pod is running in.
  3. EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

zone aware routing

This approach is mutually exclusive with locality weighted load balancing, since in the case of locality aware LB, we rely on the management server to provide the locality weighting, rather than the Envoy-side heuristics used in zone aware routing.

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, value meaning which zone Envoy Pod is running in.
  3. Envoy's bootstrap config should be set with cluster_manager. local_cluster_name, means which fleet Envoy Pod belongs to, it will be irKey in implementation.
  4. Add cluster corresponding to cluster_manager. local_cluster_name to CDS resources.
  5. Design a mechanism to discover Envoy Pods belongs to cluster_manager. local_cluster_name as endpoints and add them to EDS resources.
  6. Both Envoy and Backend cluster are not in panic mode ^5.

personal preference

Since step 1 and 2 is required by both, priority levels can work with implemented locality weighed load balancing, but zone aware routing can't. Apparently priority levels are easier to implement. But it requires EDS resources should be arranged in xds/cache module for individual Envoy. No matter EG do this, or create new xDS Hook API, like PostEndpointModify(ClusterLoadAssignment, Node) which allow extension server to do this.

aoledk avatar May 23 '24 10:05 aoledk

thanks for outlining the steps @aoledk ! we currently have https://github.com/envoyproxy/gateway/issues/3055 open to get explicit priority per backendRef and program that into the xds cluster resource.

In the future, we can use this issue to make sure we track the auto priority work, the field in k8s preferClose could be the knob for users to say they want to opt in to this feature

arkodg avatar May 23 '24 21:05 arkodg

Hi @aoledk, regarding:

priority levels [...] EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

Is this option viable? Can our XDS server produce different EDS for different envoy pods that are part of the same Envoy deployment?

guydc avatar Jun 06 '24 18:06 guydc

I think it's possible. xDS server can read the locality info of envoy node.

The cache will be keyed based on a pre-defined hash function whose keys are based on the Node information.

// Identifies a specific Envoy instance. Remote server may have per Envoy configuration.
message Node {
  // An opaque node identifier for the Envoy node. This must be set.
  string id = 1;
  // The cluster that the Envoy node belongs to. This must be set.
  string cluster = 2;
  google.protobuf.Struct metadata = 3;
  Locality locality = 4;
  // This is motivated by informing a management server during canary which
  // version of Envoy is being tested in a heterogeneous fleet.
  string build_version = 5;
}

modatwork avatar Jun 07 '24 07:06 modatwork

Thanks for pointing that out @modatwork. My other concerns wrt. to this approach are:

  • Conflict with general-purpose use cases of priorities (e.g. to support things like active/passive failover)
  • Possible impact on memory consumption if we have to maintain a copy of the cache for each locality. Not sure if that's already the situation today. @arkodg - do you know?

In general:

  • I'm +1 to supporting zone-aware routing in EG.
  • I would avoid using priorities for this feature in EG's built-in feature set.
  • The the extension-server approach to EP priority manipulation could work. If we don't have a per-locality cache, maybe that should be an opt-in feature.

Is there a reason to prefer the Priority-based approach? I'm not sure that it's significantly simpler than enabling zone-aware routing.

guydc avatar Jun 07 '24 18:06 guydc

is @modatwork the same person as @aoledk :) ?

Possible impact on memory consumption if we have to maintain a copy of the cache for each locality. Not sure if that's already the situation today. @arkodg - do you know?

@guydc we have are dumuxing on gateway/IR, with locality it would add another dimension lookup and would increase memory by num localities * total (xds per gateway * gateway resources)

arkodg avatar Jun 07 '24 19:06 arkodg

@arkodg I work together with @modatwork

aoledk avatar Jun 08 '24 15:06 aoledk

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jul 08 '24 16:07 github-actions[bot]

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)

arkodg avatar Jul 31 '24 19:07 arkodg

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)

@arkodg I can help.

aoledk avatar Aug 01 '24 06:08 aoledk

awesome thanks @aoledk !

arkodg avatar Aug 01 '24 16:08 arkodg

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Aug 31 '24 20:08 github-actions[bot]

hey @aoledk still planning on working on this one for v1.2 ?

arkodg avatar Sep 19 '24 19:09 arkodg

hey @aoledk still planning on working on this one for v1.2 ?

Hi @arkodg nowadays I'm working on bring in EG v1.1., next month I will continue on this feature, but not sure whether it can be merged into v1.2 (Due by October 30, 2024), maybe v1.3.

aoledk avatar Sep 20 '24 11:09 aoledk

thanks for the update @aoledk, let us know if you hit any issues while running EG v1.1 moving this issue into backlog

arkodg avatar Sep 20 '24 17:09 arkodg

@arkodg LGTM.

aoledk avatar Sep 23 '24 02:09 aoledk

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Oct 23 '24 04:10 github-actions[bot]

Hi @aoledk! Are you still looking into implementing it yourself? If not, I’m interested in this feature and can work on bringing it to life.

flyik avatar Dec 04 '24 20:12 flyik

@flyik recently I'm busy with bringing in EG, you can go ahead.

aoledk avatar Dec 04 '24 21:12 aoledk

@flyik I've unassigned myself, you can assign to yourself.

aoledk avatar Dec 04 '24 23:12 aoledk

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jan 04 '25 04:01 github-actions[bot]

keep

kahirokunn avatar Jan 09 '25 07:01 kahirokunn