envoy Updating LbEndpoint metadata causes connection churn

If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing [email protected] where the issue will be triaged appropriately.

Title: Updating LbEndpoint metadata causes connection churn

Description: When envoy receives a ClusterLoadAssignment with same endpoints but with updated metadata , envoy tearsdown existing connection and re-establishes a new one with all the endpoints. Is this the expected ? Or is there a relation with any other configuration which causes this connection churn ? [optional Relevant Links:]

Any extra documentation required to understand the issue.

Jul 02 '24 19:07 thejas-stripe

Is this LbEndpoint metadata or LocalityLbEndpoints?

Jul 02 '24 22:07 htuch

@htuch Its LbEndpoint metadata - https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/endpoint_components.proto#config-endpoint-v3-lbendpoint

I'm updating the filter_metadata in the metadata structure, which causes connection churn.

Jul 03 '24 15:07 thejas-stripe

The metada value is similar to this

old

"metadata": {
           "filter_metadata": {
            "envoy.lb": {
             "filed1": "value1",
            }
           }

Updated metadata will look like this

"metadata": {
           "filter_metadata": {
            "envoy.lb": {
             "filed1": "value1",
             "field2": "value2"
            }
           }

Jul 03 '24 15:07 thejas-stripe

@cpakulski @adisuissa

Jul 04 '24 03:07 htuch

Envoy version: 1.28.3

I removed the modification to metadata in the EDS update, but that did not resolve the issue.

Digging more into this issue, it looks like the issue is caused by CDS update. Below are some envoy trace logs

[993365][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:774] add/update cluster echo-srv starting warming

[993365][debug][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:51] cds: add/update cluster 'echo-srv'

[993365][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1579] initializing Secondary cluster echo-srv completed

[993931][trace][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:1454] Schedule destroy cluster info echo-srv

There is an CDS update at this time window.

The only change between the previous CDS and the new CDS update for echo-srv cluster shown below

old CDS

lb_subset_config: {
    fallback_policy: ANY_ENDPOINT
    subset_selectors: { keys: "field1" }
}

New CDS update has more subset config

lb_subset_config: {
    fallback_policy: ANY_ENDPOINT
    subset_selectors: [
        { keys: "field1" },
        { keys: "field2", fallback_policy: NO_FALLBACK },
        { keys: ["field1", "field2"], fallback_policy: NO_FALLBACK },
        { keys: "field3", fallback_policy: NO_FALLBACK },
        { keys: ["field1", "field3"], fallback_policy: NO_FALLBACK }
    ]
}

This is causing the creation of new cluster instance for echo-srv. It perform separate EDS query to control plane and establish new connections and the old cluster tore down causes old connections destroyed.

Do we expect creation of a new cluster and tear down of old one when there is a change to lb_subset_config ?

Jul 23 '24 06:07 thejas-stripe

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Aug 22 '24 16:08 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Aug 29 '24 20:08 github-actions[bot]