ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

Doc: AWS NLB idle timeout

Open iusergii opened this issue 3 years ago • 6 comments

There is an unclear statement in documentation in regards to AWS NLB Timeout. It says that keepalive_timeout must be less than NLB (350s).

With a default param of 75s, my assumption is that the controller silently closes the TCP session after it. And once the request hit NLB, the session is "dead" and the client gets a connection timeout after 5 seconds.

Don't we need to set keepalive_timeout higher than 350s?

https://github.com/kubernetes/ingress-nginx/issues/6036 https://github.com/kubernetes/ingress-nginx/issues/5548

iusergii avatar Jun 27 '22 20:06 iusergii

@iusergii: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 27 '22 20:06 k8s-ci-robot

/remove-kind bug /area docs

longwuyuan avatar Jun 27 '22 23:06 longwuyuan

@aledbf can you check if your fix might have been the wrong way around?

thomaschaaf avatar Jul 11 '22 13:07 thomaschaaf

According to AWS NLB Document, NLB use TCP Keep-Alive to keep tcp connection to upstream. TCP Keep-Alive is not equal to HTTP/1.1 Keep-Alive.

If you use AWS CLB (L7), It's use HTTP/1.1 Keep-Alive to nginx, so that nginx's keep-alive timeout in required higher than CLB, and then every upstream keep-alive timeout is required higher than nginx.

But If you use AWS NLB (L4), Its' use TCP Keep-Alive To Nginx. Keep in mind. AWS NLB TCP Keep-Alive will not close idle connections which are higher than 350s until nginx pick up one of idle connections and receive TCP RSP from AWS NLB to tell nginx picked connection is not valid. To prevent nginx pick up idle connection, nginx's keep-alive timeout is required to lower than AWS NLB. Therefore, ingress-nginx document is right.

In summary your case, nginx keep-alive timeout is less than aws nlb 350s. and upstreams keep-alive is higher than nginx.

HsinHeng avatar Jul 12 '22 16:07 HsinHeng

@HsinHeng thank you for the details. I'm raising that as I'm following the recommendation for keep_alive timeout ( 300s ) and periodically getting the connection timeouts since we migrated to NLB. It seems the сonneciton flow has a bunch of moving parts:

  • NLB (350s)
  • Linux tcp_keep_alive (2h)
  • kube-proxy/IPVS timeouts (900s)
  • nginx

There is also an NLB limitation in case we have an "internal" request with client IP preservation enabled.

I'm a bit puzzled and would love to see more documentation/recommendations in regards to AWS NLB.

iusergii avatar Jul 17 '22 16:07 iusergii

@HsinHeng thank you for the details. I'm raising that as I'm following the recommendation for keep_alive timeout ( 300s ) and periodically getting the connection timeouts since we migrated to NLB. It seems the сonneciton flow has a bunch of moving parts:

  • NLB (350s)
  • Linux tcp_keep_alive (2h)
  • kube-proxy/IPVS timeouts (900s)
  • nginx

There is also an NLB limitation in case we have an "internal" request with client IP preservation enabled.

I'm a bit puzzled and would love to see more documentation/recommendations in regards to AWS NLB.

@iusergii I found the discuss at https://github.com/Kong/kong/issues/9169. Maybe enable nginx listen so_keepalive will resolve your timeout problem. Actually, I am a little confused about nginx keepalive (default 75s). Maybe it is related to HTTP/1.1 keepalive.

HsinHeng avatar Aug 10 '22 17:08 HsinHeng

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 09 '22 02:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 09 '22 02:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 08 '23 03:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 08 '23 03:01 k8s-ci-robot

Hello @iusergii,

We are facing exactly the same behaviour. Have you found any solution to this ?

With a default param of 75s, my assumption is that the controller silently closes the TCP session after it.

We see in NLB monitoring high number of "Target Reset Count" which I assume means that NLB is aware of closed connections.

client gets a connection timeout after 5 seconds.

We also see 5s but could not find explanation of it ? Why exactly 5s ?

KhasDenis avatar Mar 20 '24 09:03 KhasDenis