ingress-nginx
ingress-nginx copied to clipboard
Doc: AWS NLB idle timeout
There is an unclear statement in documentation in regards to AWS NLB Timeout. It says that keepalive_timeout must be less than NLB (350s).
With a default param of 75s, my assumption is that the controller silently closes the TCP session after it. And once the request hit NLB, the session is "dead" and the client gets a connection timeout after 5 seconds.
Don't we need to set keepalive_timeout higher than 350s?
https://github.com/kubernetes/ingress-nginx/issues/6036 https://github.com/kubernetes/ingress-nginx/issues/5548
@iusergii: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-kind bug /area docs
@aledbf can you check if your fix might have been the wrong way around?
According to AWS NLB Document, NLB use TCP Keep-Alive to keep tcp connection to upstream. TCP Keep-Alive is not equal to HTTP/1.1 Keep-Alive.
If you use AWS CLB (L7), It's use HTTP/1.1 Keep-Alive to nginx, so that nginx's keep-alive timeout in required higher than CLB, and then every upstream keep-alive timeout is required higher than nginx.
But If you use AWS NLB (L4), Its' use TCP Keep-Alive To Nginx. Keep in mind. AWS NLB TCP Keep-Alive will not close idle connections which are higher than 350s until nginx pick up one of idle connections and receive TCP RSP from AWS NLB to tell nginx picked connection is not valid. To prevent nginx pick up idle connection, nginx's keep-alive timeout is required to lower than AWS NLB. Therefore, ingress-nginx document is right.
In summary your case, nginx keep-alive timeout is less than aws nlb 350s. and upstreams keep-alive is higher than nginx.
@HsinHeng thank you for the details. I'm raising that as I'm following the recommendation for keep_alive timeout ( 300s ) and periodically getting the connection timeouts since we migrated to NLB. It seems the сonneciton flow has a bunch of moving parts:
- NLB (350s)
- Linux tcp_keep_alive (2h)
- kube-proxy/IPVS timeouts (900s)
- nginx
There is also an NLB limitation in case we have an "internal" request with client IP preservation enabled.
I'm a bit puzzled and would love to see more documentation/recommendations in regards to AWS NLB.
@HsinHeng thank you for the details. I'm raising that as I'm following the recommendation for keep_alive timeout ( 300s ) and periodically getting the connection timeouts since we migrated to NLB. It seems the сonneciton flow has a bunch of moving parts:
- NLB (350s)
- Linux tcp_keep_alive (2h)
- kube-proxy/IPVS timeouts (900s)
- nginx
There is also an NLB limitation in case we have an "internal" request with client IP preservation enabled.
I'm a bit puzzled and would love to see more documentation/recommendations in regards to AWS NLB.
@iusergii I found the discuss at https://github.com/Kong/kong/issues/9169. Maybe enable nginx listen so_keepalive will resolve your timeout problem. Actually, I am a little confused about nginx keepalive (default 75s). Maybe it is related to HTTP/1.1 keepalive.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hello @iusergii,
We are facing exactly the same behaviour. Have you found any solution to this ?
With a default param of 75s, my assumption is that the controller silently closes the TCP session after it.
We see in NLB monitoring high number of "Target Reset Count" which I assume means that NLB is aware of closed connections.
client gets a connection timeout after 5 seconds.
We also see 5s but could not find explanation of it ? Why exactly 5s ?