gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

Client hangs when gevent worker uses long-running threads

Open sevein opened this issue 6 years ago • 3 comments

I'm troubleshooting a deployment of a small Django app that is using Python 2.7 and Gunicorn v19.9.0 (config here) with a single gevent worker. The problem described below is not reproducible when I used the sync class worker.

The application is implementing background jobs within the worker - a compromise until we can afford a better solution. The application has a request handler that is used to run certain long-running tasks (namely, tools executed via subprocess that can take several minutes to complete) - the handler basically creates a new thread via threading.Thread and returns the response as soon as possible. FWIW, when the worker is initialized the app also starts a separate thread that watches all our background jobs in a loop (uses time.sleep to yield control). This watcher maintains a list of Thread instances created by the application. My understanding is that the standard library is patched and the coroutines are using cooperative code like gevent.subprocess or gevent.sleep, but always through the patched standard library. It's not patching explicitly, just assuming that workers/ggevent.py does that for us.

This setup has been working in most cases. However, in one deployment where Gunicorn sits behind Nginx, the requests to this particular endpoint described above hang until the HTTP client's read timeout is exceeded. Gunicorn does not log anything suspicious:

[2018-10-26 07:24:05 +0000] [16129] [DEBUG] POST /api/v2/location/a2cb908c-6642-4083-b1f6-58e7851ffd6b/async/
[2018-10-26 07:24:05 +0000] [16129] [DEBUG] Closing connection.

I can reproduce just using cURL:

$ curl -v --request POST --header "Authorization: ApiKey test:test" --header "Content-Type: application/json" --data @request.json \
    http://127.0.0.1:8000/api/v2/location/a2cb908c-6642-4083-b1f6-58e7851ffd6b/async/

* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> POST /api/v2/location/a2cb908c-6642-4083-b1f6-58e7851ffd6b/async/ HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Authorization: ApiKey test:test
> Content-Length: 343
> 
* upload completely sent off: 343 out of 343 bytes
< HTTP/1.1 202 ACCEPTED
< Server: nginx
< Date: Fri, 26 Oct 2018 02:00:45 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept, Accept-Language, Cookie
< X-Frame-Options: SAMEORIGIN
< Location: http://127.0.0.1:8000/api/v2/async/57/
< Content-Language: en
< 
[******hangs********]

cURL hangs until Nginx's proxy_read_timeout is exceeded - where cURL exits with the following message: curl: (18) transfer closed with outstanding read data remaining.

If Nginx's proxy_buffering is enabled, cURL does not receive the response until the background job in my application completes!

However, I can't reproduce when I connect cURL to Gunicorn's web server directly - the response from the application is received immediately and cURL exits even when I know that the thread created by the handler is still running.

So I'm realizing that the threading solution that the Django app implements may not be suitable but I'd like to understand why. I don't totally get what's going on. I've confirmed that the problem I'm describing is not reproducible when the threads are not created by the worker or when they complete shortly.

Thank you in advance!

sevein avatar Oct 26 '18 15:10 sevein

The problem disappeared when proxy_http_version 1.1 was set in Nginx.

I looked at the traffic between Nginx and Gunicorn with Wireshark but I haven't found anything interesting other than I can confirm that the response is sent right away but not after a while the last frame with no data is sent:

image

That's about the same amount of time that took the application thread to complete.

Thanks for reading!

sevein avatar Oct 30 '18 02:10 sevein

Better to run long running jobs outside of the web server

ifduyue avatar Nov 05 '18 09:11 ifduyue

the problem still not be fixed!!!

psc0606 avatar Jun 01 '22 02:06 psc0606