envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Extend the functionality of max_connection_duration + drain_timeout

Open Winbobob opened this issue 1 year ago • 9 comments

Title: Extend the functionality of max_connection_duration and drain_timeout by introducing drain_timeout_jitter and drain_percentage.

Description: We are looking into ways to balance long-lived HTTP downstream connections. max_connection_duration + drain_timeout can be used for this purpose. However, we think its functionality could be extended and be more flexible. Assume this scenario: one client establishes many connections almost at the same time, once the connection duration limit reaches and there are no active streams, all connections will be drained in a short period, which may cause high latency for later requests since establishing connections need some time.

Proposal:

  • Introduce drain_timeout_jitter, default to 0. If it is set to > 0, not all connections will be drained at the same time and if the jitter is long enough new connections have already been established and the impact of client could be minimal.

  • Introduce drain_percentage, default to 1. If it is set to (0,1), only a certain percentage of connections will be drained this time. For other connections, the connection duration will be reset and wait for the next round of drain.

Relevant Links: https://github.com/envoyproxy/envoy/issues/15283

Winbobob avatar Jul 23 '24 23:07 Winbobob

cc @alyssawilk and @mattklein123, any thoughts?

Winbobob avatar Jul 29 '24 01:07 Winbobob

So the problem here is you have a high traffic service, you have a set connection drain time, and all your connections drain around the same time resulting in a bunch of lag all at the same time instead of spread out lag? I think it's OK to add a jitter function but I'd encourage you to instead experiment with preconnect since if you have a preconnect ratio set you'd have a preestablished connection when each drained, and you'd theoretically have no slow start latency at all, instead taking your batched latency and smearing it out a bit.

alyssawilk avatar Jul 29 '24 15:07 alyssawilk

experiment with preconnect

Thank you for the pointer! I went through the doc, looks like per_upstream_preconnect_ratio will only work for non-multiplexed connections?

Winbobob avatar Jul 29 '24 17:07 Winbobob

I think for H2 it still works, it just may not establish many connections. e.g. if you have 100 streams and want 10% prefetch ratio you need to prepare for 10 extra streams and will have 1 extra connection. If you have 5000 streams you'll have 5 extra connections

alyssawilk avatar Jul 29 '24 19:07 alyssawilk

So the problem here is you have a high traffic service, you have a set connection drain time, and all your connections drain around the same time resulting in a bunch of lag all at the same time instead of spread out lag?

Not exactly. The problem is that we want to balance long-lived HTTP downstream connections. And the PreconnectPolicy only works for upstream connections. So looks like we still need to implement the jitter function.

Winbobob avatar Jul 29 '24 19:07 Winbobob

oh sorry misread and mis-assumed, yeah preconnect is for upstream and you're looking at downstream. jitter SGTM

alyssawilk avatar Jul 29 '24 19:07 alyssawilk

Thank you for the confirmation! No problem at all! I really should highlight the downstream in the description.

Winbobob avatar Jul 29 '24 19:07 Winbobob

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 28 '24 20:08 github-actions[bot]

This issue is not stale, I will try to work on the implementation.

Winbobob avatar Aug 28 '24 20:08 Winbobob

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 28 '24 00:09 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Oct 05 '24 04:10 github-actions[bot]

@Winbobob, @alyssawilk was there any workaround or solution built for adding a jitter to max connection duration?

dragonmushu avatar Mar 03 '25 17:03 dragonmushu