passenger icon indicating copy to clipboard operation
passenger copied to clipboard

Passenger should not process requests from clients that have disconnected before the request is routed to an application processes

Open felixbuenemann opened this issue 9 years ago • 20 comments

Passenger 5.0.4 stays blocked for minutes after a request storm while the log fills up with:

Disconnecting client with error: client socket write error: Broken pipe (errno=32)

This fact can easily be used for a denial of service:

# baseline response time (simulated slow endpoint)
curl -sw ' time_total: %{time_total}s\n' localhost:3000
YdxtDssIUc9zMl6UeGMncA== time_total: 10,007s

# hit passenger with lots of request that time out
wrk -c100 -d10s http://localhost:3000/
Running 10s test @ http://localhost:3000/
  2 threads and 100 connections
  1 requests in 10.02s, 286.00B read
  Socket errors: connect 0, read 0, write 0, timeout 374

# now try another request
curl -sw ' time_total: %{time_total}s\n' localhost:3000
7ypBhdFvn1Wpwo6Q6mtaOA== time_total: 992,797s

By looking at the passenger log, it is evident that passenger is blocked processing all the requests made by clients that are no longer connected:

passenger start -e production --min-instances 1 --max-pool-size 1 --engine nginx
=============== Phusion Passenger Standalone web server started ===============
PID file: /home/felix/rack-hello/passenger.3000.pid
Log file: /home/felix/rack-hello/passenger.3000.log
Environment: production
Accessible via: http://0.0.0.0:3000/

You can stop Phusion Passenger Standalone by pressing Ctrl-C.
Problems? Check https://www.phusionpassenger.com/documentation/Users%20guide%20Standalone.html#troubleshooting
===============================================================================
App 62016 stdout:
App 62032 stdout:
[ 2015-03-13 10:10:16.8570 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-1] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
[ 2015-03-13 10:11:24.1558 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-5] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
[ 2015-03-13 10:11:34.1575 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-6] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
[ 2015-03-13 10:11:44.1600 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-7] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
…
[ 2015-03-13 10:27:34.4088 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-101] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
[ 2015-03-13 10:27:44.4121 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-100] Disconnecting client with error: client socket write error: Broken pipe (errno=32)
[ 2015-03-13 10:28:04.4202 61985/7fa8b06c0700 Ser/Server.h:894 ]: [Client 1-105] Disconnecting client with error: client socket write error: Broken pipe (errno=32)

The problem here is that passenger detects client timeouts only after the processing of the app response is finished although the client might have disconnected before the request even left the request queue.


Todo list:

  • [ ] Implement request queuing timeout
  • [ ] Rename max_request_time to max_response_time
  • [ ] Documentation

felixbuenemann avatar Mar 13 '15 17:03 felixbuenemann

For what it's worth thin, unicorn and puma share similar issues, while webrick is not affected, probably because it lacks a request queue.

@FooBarWidget Do you consider this a bug, or is it expected behavior?

felixbuenemann avatar Mar 13 '15 18:03 felixbuenemann

Note: The problem is exaggerated by choosing a 10 seconds response time. For an average rails app the response time without caching is usually in the range of 100–400ms, so passenger should recover much faster. The only real protection against such an attack is probably to employ request limiting in a reverse proxy like nginx or haproxy which can throttle down individual misbehaving clients.

felixbuenemann avatar Mar 13 '15 19:03 felixbuenemann

I would call this neither a bug, nor expected behavior. I'm not really sure how to categorize this. It's definitely unwanted, but it's not caused by anything that's Passenger's fault.

What's happening here is that wrk creates a lot of connections. For each connection, Passenger waits until there is a free application process, then forwards the request there.

But the only way to detect whether a client has disconnected, is by reading from its socket, or by writing to its socket. When Passenger has read the HTTP header and its corresponding body, it stops reading from the socket because any further data may belong to next pipelined request. Similarly, there's not much Passenger can send to the client until an application process has become available for processing the request, and the application has generated a response.

The Unicorn author implemented a hack for this problem. There was an exchange on the Unicorn mailing list between Shopify and the Unicorn author, which documents the problem pretty well. Unicorn writes the string 'HTTP' to the socket before request processing actually begins. Then, when writing the response, it simply omits the 'HTTP' part of the response because it has already been written. Sometimes, writing 'HTTP' early on allows it to detect a disconnection. But as you have observed, it doesn't always work.

FooBarWidget avatar Mar 13 '15 19:03 FooBarWidget

Hmm, this seams like a hard problem to solve. Maybe it can be tackled from another angle.

If I understand correctly passenger uses a request queue with a default depth of 100 entries. If that queue fills up it starts throwing 503 errors.

How about also setting a maximum queue time (as opposed to a maximum request time) and throwing a 504 to clients that have been enqueued longer than that time?

I'm not familiar with passenger enterprise, which might already implement that.

felixbuenemann avatar Mar 13 '15 21:03 felixbuenemann

Passenger Enterprise indeed already implements that, through the passenger_max_request_time option.

FooBarWidget avatar Mar 14 '15 06:03 FooBarWidget

From the description of passenger_max_request_time it seems that it only applies to requests that are currently processed by the app and not to queued requests.

felixbuenemann avatar Mar 14 '15 08:03 felixbuenemann

Hm, you are right. The passenger_max_request_time timer starts after a process has been selected for routing, not before.

Now that I think about it, 'max request time' may not be the right terminology. The current implementation is more like a 'max response time'.

We should have another timeout that also applies to the queuing.

FooBarWidget avatar Mar 16 '15 08:03 FooBarWidget

We've recently had a discussion with an Enterprise customer about a similar problem. They are on Heroku, and whenever they deploy a new version, Heroku would spin up an instance, start Passenger and immediately direct traffic to that instance. This customer gets so much traffic, and spawning their app takes such a long time, that during the first minute or so their request queue would grow beyond the queue size.

Assuming that their clients disconnect early because of the long response time, if we are able to check for a disconnection then it would partially solve his problem.

On a side note: I suppose that, while the first process is being spawned, it is acceptable for the queue to grow over its configured size. But special casing this is not easy.

FooBarWidget avatar Apr 10 '15 21:04 FooBarWidget

I've been researching the poll() system call. It might be possible to use POLLHUP to detect early disconnections. On OS X, POLLHUP seems to do what I want. On Linux, POLLHUP doesn't, but POLLRDHUP does.

http://www.greenend.org.uk/rjk/tech/poll.html describes the behaviors on various operating systems, although it is wrong about Linux behavior.

FooBarWidget avatar Apr 10 '15 21:04 FooBarWidget

We have received a vote from an Enterprise customer for this feature. His use case is that they're getting traffic too quickly, so their requests are filling up in the queue. But his case is not so much that clients abort the connection (causing Passenger to process work that is already gone), but rather that he doesn't want clients to wait too long in the queue. I've split his issue into a separate ticket: #1688

FooBarWidget avatar Dec 01 '15 09:12 FooBarWidget

Some progress has been made towards this today. The goal is to be able to detect early client disconnections by 5.0.26 so that we can forward half-closing events to Node.js apps. It won't actually abort anything until later versions though.

FooBarWidget avatar Feb 26 '16 17:02 FooBarWidget

@FooBarWidget Will this fix be for Enterprise only?

thomasdavis avatar Nov 22 '16 08:11 thomasdavis

The change described by @FooBarWidget is in OSS as well as Enterprise.

CamJN avatar Aug 03 '17 13:08 CamJN

In my app I have some data collection paths like /api/log/client_jserr/ where I'd prefer to get all requests, even after the client has given up waiting, so I'd like an option that makes my app receive the request with some annotation that the client is gone. In case poll didn't fully solve the detection problem, at least for the subset of requests that arrive over a TCP connection, we could check how long they have been waiting, and if it's very long, spend some extra CPU cycles to ask the network stack about its opinion on the associated TCP connection.

mk-pmb avatar Aug 17 '17 18:08 mk-pmb

@mk-pmb thanks for the input.

CamJN avatar Aug 18 '17 09:08 CamJN

What is the best workaround for this issue? Faced with it on Nginx + Phusion Passenger 6.0.4 All instances stucks for hours and all with same error (passenger-status --show=server) The client connection is closed before the request is done processing (errno=-1020)

izenk avatar Apr 08 '20 22:04 izenk

What is the best workaround for this issue? Faced with it on Nginx + Phusion Passenger 6.0.4 All instances stucks for hours and all with same error (passenger-status --show=server) The client connection is closed before the request is done processing (errno=-1020)

Based on the documentation, there is a "Passenger Enterprise only" configuration to workaround this issue on your application called passenger_max_request_time (https://www.phusionpassenger.com/docs/references/config_reference/nginx/#passenger_max_request_time).

If you are on the free version, good luck. Apparently there is no workaround on this, you have to fix your code.

henriquekraemer avatar May 11 '20 13:05 henriquekraemer

Workaround for us was to play with passenger/nginx options: proxy_ignore_client_abort, passenger_ignore_client_abort on.

izenk avatar May 11 '20 13:05 izenk

Workaround for us was to play with passenger/nginx options: proxy_ignore_client_abort, passenger_ignore_client_abort on.

Hummm... But with theses configs set as "on", the request still be processed by passenger, right?

passenger_ignore_client_abort

  • If this option is turned on then, upon a client abort, Passenger will continue to read the application process's response while discarding all the read data. This prevents EPIPE errors but it will also mean the application process will be unavailable for new requests until it is done sending its response.

https://www.phusionpassenger.com/docs/references/config_reference/nginx/#passenger_ignore_client_abort

proxy_ignore_client_abort

  • Determines whether the connection with a proxied server should be closed when a client closes the connection without waiting for a response.

http://nginx.org/en/docs/http/ngx_http_proxy_module.html

I ended up here cause today one of ours API routes got stucked on the database (CPU went to 100% of usage on this specific route). I was trying to find a solution o minimize when this happens again.

We are running on a multi-tenancy configuration with apartment gem and the database its kind of our "weak spot".

henriquekraemer avatar May 11 '20 14:05 henriquekraemer

Yes, request will be processed, but our problems were that passenger processes stucked with error "Disconnecting client with error: client socket write error: Broken pipe (errno=32)" and no new requests were served. Now "aborted requests" are still processed, but output is silently discarded (if I understand correctly the logic of these options). As I understand, you problem is quite different: you need somehow to refuse such "aborted" requests at all, to reduce load on database.

izenk avatar May 11 '20 14:05 izenk