Updating LbEndpoint metadata causes connection churn
If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing [email protected] where the issue will be triaged appropriately.
Title: Updating LbEndpoint metadata causes connection churn
Description: When envoy receives a ClusterLoadAssignment with same endpoints but with updated metadata , envoy tearsdown existing connection and re-establishes a new one with all the endpoints. Is this the expected ? Or is there a relation with any other configuration which causes this connection churn ? [optional Relevant Links:]
Any extra documentation required to understand the issue.
Is this LbEndpoint metadata or LocalityLbEndpoints?
@htuch Its LbEndpoint metadata - https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/endpoint_components.proto#config-endpoint-v3-lbendpoint
I'm updating the filter_metadata in the metadata structure, which causes connection churn.
The metada value is similar to this
old
"metadata": {
"filter_metadata": {
"envoy.lb": {
"filed1": "value1",
}
}
Updated metadata will look like this
"metadata": {
"filter_metadata": {
"envoy.lb": {
"filed1": "value1",
"field2": "value2"
}
}
@cpakulski @adisuissa
Envoy version: 1.28.3
I removed the modification to metadata in the EDS update, but that did not resolve the issue.
Digging more into this issue, it looks like the issue is caused by CDS update. Below are some envoy trace logs
[993365][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:774] add/update cluster echo-srv starting warming
[993365][debug][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:51] cds: add/update cluster 'echo-srv'
[993365][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1579] initializing Secondary cluster echo-srv completed
[993931][trace][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1454] Schedule destroy cluster info echo-srv
There is an CDS update at this time window.
The only change between the previous CDS and the new CDS update for echo-srv cluster shown below
old CDS
lb_subset_config: {
fallback_policy: ANY_ENDPOINT
subset_selectors: { keys: "field1" }
}
New CDS update has more subset config
lb_subset_config: {
fallback_policy: ANY_ENDPOINT
subset_selectors: [
{ keys: "field1" },
{ keys: "field2", fallback_policy: NO_FALLBACK },
{ keys: ["field1", "field2"], fallback_policy: NO_FALLBACK },
{ keys: "field3", fallback_policy: NO_FALLBACK },
{ keys: ["field1", "field3"], fallback_policy: NO_FALLBACK }
]
}
This is causing the creation of new cluster instance for echo-srv. It perform separate EDS query to control plane and establish new connections and the old cluster tore down causes old connections destroyed.
Do we expect creation of a new cluster and tear down of old one when there is a change to lb_subset_config ?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.