envoy How to make sure the legacy websocket connection still works if the listeners update without using the envoy hot-restart ?

Title: How to make sure the legacy websocket connection still works if the listeners update without using the envoy hot-restart ?

Description:

We are using the file system based LDS for dynamic resource update, and also envoy was also working as websocket proxy. If some lds (ip,socket options, or tls configs) change happen ,the listener will be draining and new listener will be created. But the legacy listeners's websocket connection will broken during these listeners' update period. So is there any methods or solution to handler the existing connections smoothly switch to the new listener?

Jul 18 '24 13:07 wufanqqfsc

AFAIK, LDS will update in place for some filter chain changes but otherwise we will drain the existing listener which will drain the existing websocket connection as you've seen. AFAIK there's no mechanism to otherwise get around this.

Jul 18 '24 14:07 KBaichoo

@KBaichoo if envoy listener can't do this , how envoy handle the legacy and new connections and traffic smoothly during some Control Plane configuration update ?

Jul 19 '24 13:07 wufanqqfsc

see https://github.com/envoyproxy/envoy/blob/1abf5e106fd15d7636e306b02c08ca55ec4bbd27/source/common/listener_manager/listener_manager_impl.cc#L800 for how in place filter chain update works and the callers of it to see the conditions where that holds true.

I don't think it's a good idea to expand that criteria to other fields such as ip, socket options, etc.

See also https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-drain-time-s if you want to increase your drain timeout so drained WS connection live longer.

Jul 19 '24 14:07 KBaichoo

@KBaichoo what will happen if the drain-time set to -1, seems the old version listener will not be draining any more, and the old connection will still usage able .And the new listener will also bind to the workers.

So after all the legacy connection in old listener filter chain was closed , the old version listener will continue draining or not ?

Jul 23 '24 14:07 wufanqqfsc

I think it'll set the value to uint32_t::max which will effectively disable draining.

Jul 24 '24 14:07 KBaichoo

yes, we have done some test. uint32_t::max or big value here may work , but our concern is if update listener resources many times and what will happen for these draining listeners objects . Is there any memory leak risk since these objects may can't be destroy since the draining timer is not triggered.

Jul 25 '24 08:07 wufanqqfsc

Is there any memory leak risk since these objects may can't be destroy since the draining timer is not triggered.

I'd think so since you're preventing cleanup. You should measure it for yourself and to see if it's appropriate for your use case. It's a tradeoff between drain-timeout and resource leak delay. Maybe 1h? 3h? 6h? 12h?24h? might be sufficient for your drain timeout vs "never drain"

Jul 25 '24 13:07 KBaichoo

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Aug 24 '24 16:08 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Aug 31 '24 20:08 github-actions[bot]

What happens if multiple old filter chains are not drained ? It will increase the memory requirements ? Is there a limit on max number of filter chains to be managed envoy ?

Nov 06 '24 14:11 AmitKatyal-Sophos

What happens if multiple old filter chains are not drained ?

> nothing happens but some object will not released.

It will increase the memory requirements ?

yes

Is there a limit on max number of filter chains to be managed envoy ?

seems no such limit. At least i can't find such kind of parameter

We select to such solution to enhance envoy for our case.

we not using this drain-time control logic and switch to add a new "drain-check-interval" . if (drain-check-interval was configured) 1.We will not do drain the listener and the filter chains. 2.Start one timer according the value of this "drain-check-interval" we will start one timer to check the existing connections for this should be drained listener and filter chains for all the workers. If all the works' connection is closed , then drain the listener and the filter chains so that this will not impact the existing WebSocket connections during listener update and draining happened.

Nov 07 '24 09:11 wufanqqfsc

@wufanqqfsc Thanks for the quick response!

Has this custom solution been ported back to the public Envoy repository? If so, could you please share the PR details?

Nov 07 '24 10:11 AmitKatyal-Sophos

Not yet , @KBaichoo any comments for this solution ? I can provide the patch if it's ok

Nov 07 '24 13:11 wufanqqfsc

Also, one more doubt, we are updating the listener TLS context using SDS. Will it affect websocket connections on TLS context update using SDS ?. As per the envoy documentation, listener filter chain update drains the corresponding listener.

In our testing, we don't see connections getting terminated.

Nov 07 '24 13:11 AmitKatyal-Sophos