gotham
gotham copied to clipboard
"Too many open files"
With Slowloris stress I'm getting:
thread 'gotham-worker-3' panicked at 'socket error = Os { code: 24, kind: Other, message: "Too many open files" }', gotham/src/lib.rs:128:22
With ab, httperf and siege, congrats fantastic benchmarks. Thanks
@CogumelosMaravilha this might be file descriptor limits on your machine. What does ulimit -n
give you?
In a previous experiment of mine where I created a simple web server on top of hyper, I studied this issue quite a bit. The problem is that if someone connects and requests a file larger than a few tcp chunks, but never reads the response while leaving the socket open, they will indefinitely use a file descriptor on your end. While stress testing my server in this way I found that it's a way to make the server completely unresponsive by using all of its file descriptors.
Increasing the file descriptor limits only postpones this problem. You have to actively close connections if you run out of file descriptors, and you cannot have infinitely many.
Note that if your server socket returns this error, the incoming connection is not removed from the stream of incoming connections. This means that if you immediately retry after receiving the error, you will have a loop using 100% cpu attempting to retrieve the same connection again and again.
The way I fixed this was the following:
- Wrap the connection future in something, which can be killed when file descriptors are used up. I used this code to implement the wrapper in my project.
- When you receive the "Too many open files" error, call
kill
on theKillSwitch
from the wrapper above. - After calling
kill
, sleep for 120 ms before fetching the next new connection.
An important thing to notice here is that with my implementation, kill
doesn't immediately kill the threads — instead the graceful_shutdown
method is called on the connection (this just disables keep-alive for the connection) and then a tokio timer is used to allow the connection 100 ms to finish.
The importance of this timeout is that regular users won't have their connections suddenly killed, avoiding any inconvenience beyond the extra 120 ms sleep in the incoming connections loop.
A few comments:
The code currently includes a tokio Interval
, which is active even when the kill switch isn't activated. This is necessary, because the problematic threads have no activity going on, meaning the future won't be polled and thus it won't notice the kill switch. The Interval
fixes this by adding another source of polls.
You could probably enhance the code to only kill threads which have no activity, meaning that long but active downloads won't be killed. This snippet just has a timeout of 1000 seconds on all connections.