build icon indicating copy to clipboard operation
build copied to clipboard

Investigate increasing open file limit

Open MattIPv4 opened this issue 2 years ago • 12 comments
trafficstars

A follow-on from https://github.com/nodejs/nodejs.org/issues/5184 and https://github.com/nodejs/nodejs.org/issues/5149, we should look at increasing the number of files that NGINX can have open at any one time.

This could either be done by increasing the system ulimit, or by setting worker_rlimit_nofile in NGINX.

MattIPv4 avatar Mar 27 '23 19:03 MattIPv4

or by setting worker_rlimit_nofile in NGINX.

Are we allowed to do that without increasing the system limit?

targos avatar Mar 28 '23 14:03 targos

We're starting to see a lot more errors that look like this today, which I think is related.

HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR

I found this ticket from https://github.com/nodejs/nodejs.org/issues/5184

blimmer avatar Mar 29 '23 20:03 blimmer

Are we allowed to do that without increasing the system limit?

I've been testing this locally with NGINX in Docker Compose...

Soft limit set low, no NGINX limit: errors ❌

docker-compose.yml:

  ulimits:
    nofile:
      soft: 50
      hard: 2048

nginx.conf:

# worker_rlimit_nofile 65535;

Getting back errors:

ab -n 100 -c 50 -v 2 localhost/en/blog.html | grep "Non-2xx"

Non-2xx responses:      2
Soft limit set low, NGINX limit set high: errors ❌

docker-compose.yml:

  ulimits:
    nofile:
      soft: 50
      hard: 2048

nginx.conf:

worker_rlimit_nofile 65535;

Getting back errors:

ab -n 100 -c 50 -v 2 localhost/en/blog.html | grep "Non-2xx"

Non-2xx responses:      2
Soft limit set high, no NGINX limit set: works ✅

docker-compose.yml:

  ulimits:
    nofile:
      soft: 200
      hard: 2048

nginx.conf:

# worker_rlimit_nofile 65535;

No errors:

ab -n 100 -c 50 -v 2 localhost/en/blog.html | grep "Non-2xx"

[no output]
Soft limit set high, NGINX limit set high: works ✅

docker-compose.yml:

  ulimits:
    nofile:
      soft: 200
      hard: 2048

nginx.conf:

worker_rlimit_nofile 65535;

No errors:

ab -n 100 -c 50 -v 2 localhost/en/blog.html | grep "Non-2xx"

[no output]
Soft limit set high, NGINX limit set low: errors ❌

docker-compose.yml:

  ulimits:
    nofile:
      soft: 200
      hard: 2048

nginx.conf:

worker_rlimit_nofile 50;

Getting back errors:

ab -n 100 -c 50 -v 2 localhost/en/blog.html | grep "Non-2xx"

Non-2xx responses:      2

So, it would seem that we would need to bump the system soft limit for it to have an impact on NGINX?

MattIPv4 avatar Mar 31 '23 21:03 MattIPv4

I've raised the file limit in nginx.conf. I've attempted to Ansible the task in https://github.com/nodejs/build/pull/3406 (although we don't run the Ansible scripts for the web server against the live server).

richardlau avatar Jul 03 '23 16:07 richardlau

Since raising the open file limit we've not had any load balancing alerts from Cloudflare -- the last event was from before the limit was raised: image

We have however, since had several reports of slow downloads leading to timeouts

  • https://github.com/nodejs/build/issues/3408
  • https://github.com/nodejs/nodejs.org/issues/5472
  • https://github.com/nodejs/nodejs.org/issues/5471

so we may have just swapped one issue for another.

richardlau avatar Jul 05 '23 12:07 richardlau

@richardlau if it's helpful input, my colleagues and I are experiencing the issue described in https://github.com/nodejs/nodejs.org/issues/5472. We just started noticing it this week. It seemed better for us even just last week.

markandrus avatar Jul 05 '23 15:07 markandrus

I've reverted the file limit raising from https://github.com/nodejs/build/issues/3259#issuecomment-1618883633.

richardlau avatar Jul 05 '23 15:07 richardlau

~~With that file limit increase in place, where any stats recorded from the server that'd indicate what the new bottleneck was? It seems, given this fixed Cloudflare failing over, that the bottleneck before was definitely the file limit, but what is it now? Perhaps CPU?~~

Edit: From discussion in the OpenJSF Slack, the bottleneck appeared to be network throughput saturation with the file limit increased.

MattIPv4 avatar Jul 05 '23 16:07 MattIPv4

It seems like this fixed things 🤔 Thank you!

markandrus avatar Jul 07 '23 10:07 markandrus

We are experiencing issues that I'm guessing are related to this: curl: (18) HTTP/2 stream 1 was reset when attempting to download Node.

narwold avatar Sep 26 '23 13:09 narwold

Are we still going to investigate this given that we intend to move to Cloudflare R2?

ovflowd avatar Dec 03 '23 17:12 ovflowd

While I doubt this will ever get investigated, it's probably worth keeping open until NGINX is no longer in the path of any request? At the point NGINX is not serving anything, we can probably close this and a few other issues.

MattIPv4 avatar Dec 04 '23 14:12 MattIPv4