Configure envoy's idle_timeout for TCP
Feature Description
Provide a way or an example to set envoy's idle_timeout
Use Case(s)
I’m having a TCP keep-alive issue, idle connections get disconnected by the proxy after an hour, and I want to overwrite idle_timeout either at service definition or globally.
I saw some examples like the one below, but It’s not clear to me if this envoy_public_listener_json will overwrite the service name/port that was defined at the sidecar_service level?
{
"service": {
"name": "test",
"connect": {
"sidecar_service": {
"port": "xxxx",
"proxy": {
"upstreams": [
{
"destination_name": "xxxxx",
"local_bind_port": "xxxxx",
"config": {
envoy_public_listener_json= <<EOL
{
"name": "test",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": "xxxx"
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.tcp_proxy",
"config": {
"idle_timeout": "2h"
},
]
]
}
]
}
EOL
}
}
]
}
}
}
}
}
@mhomaid1 Thanks for using Consul Service Mesh. Though several Envoy features are configurable directly through Consul, there are other less common ones like the idle_timeout option you mention in this issue that is not exposed directly.
There are multiple types of listeners that Envoy exposes. The envoy_public_listener_json config option is to override the single public listener that accepts inbound connections. Each upstream you define is also associated with a corresponding Envoy listener. The configuration for those are overridden through a different [envoy_listener_json](https://www.consul.io/docs/connect/proxies/envoy#envoy_listener_json) option.
We recommend the following steps to use escape hatch functionality correctly:
- Determine which listener you want to override using the existing escape_hatch mechanism. Given your example above, it looks like you are trying to override the idle_timeout for an upstream listener rather than the public listener for inbound requests.
- Configure Consul to set up the listener without an escape hatch first
- Copy the generated listener from the Envoy admin API (http://localhost:19000/config_dump). You should be able to identify the listener from the overall config dump, it will be prefixed by the name of the upstream service in the config.
- Edit the json you copied to add the missing flag (
idle_timeoutin this case). Note that if you are updating the public listener, you'll need to remove TLS context and rbac/authz filters (this doesn't apply if you are updating the listener config for an upstream) - Drop that into the appropriate escape hatch override
We continue to add first-class configuration support for more commonly used Envoy features, but the above set of steps should help you override fields like idle_timeout.
Hope this helps.
@preetapan Thank you, I will give it a try.
It's working for me, Thank you.
I'm facing keep-alive issues with Elasticsearch(It seems that they are changing keep-alive config behavior in master), RabbitMQ, and maybe any application that open long-lived connections without data moving all the time, I'm sure there is a reason behind Envoy's idle_timeout default value(1 hour) but it's causing a problem because usually, applications will respect kernel's tcp_keepalive_time parameter which defaults to (2 hours).
I tried to lower tcp_keepalive_time but it didn't help.
I prefer to fix the problem in Envoy/Consul rather than fixing it everywhere else because everything was working fine before.
Using escape hatch functionality works fine but It would be great if it's supported by Consul, If you are planning to add support for idle_timeout, I would like to contribute.
I could use this too. Looks like the latest version of consul implements idle_timeout for HTTP requests but there is still no way to set the idle_timeout for TCP
I'm also facing connection dropping with all my TCP services through the mesh (postgres, redis). Most of the time, the client will re-establish the connection righ away, but it can cause transient issues (eg, health check failing, the app disappear from Traefik, or is restarted by Nomad when using check_restart for example)