requests.get inside gevent.spawn is getting stuck
- gevent version: version
24.2.1. Installed using pip. - Python version: cPython
3.11.4downloaded from https://github.com/python/cpython/releases/tag/v3.11.4 - Operating System: Debian 11 kernel version
5.10.0-29-amd64 - requests version:
2.32.3installed using pip
Description:
Recently we are seeing that requests.get inside a gevent.spawn is waiting for a lock which it is unable to acquire and getting stuck forever. After this the thread is unable to process anything. Since this happens on the main thread our main app goes to a stuck state unable to process any other things
Below is the stack trace that I got after running py-bt from gdb
<built-in method __enter__ of _thread.lock object at remote 0x7f3350025540>
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_threading.py", line 188, in put
with self._mutex:
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/threadpool.py", line 545, in spawn
task_queue.put((func, args, kwargs, thread_result))
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/pool.py", line 161, in apply
return self.spawn(func, *args, **kwds).get()
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
return self.pool.apply(_socket.getaddrinfo, args, kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_socketcommon.py", line 225, in getaddrinfo
addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connection.py", line 196, in _new_conn
sock = connection.create_connection(
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connection.py", line 615, in connect
self.sock = sock = self._new_conn()
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/rackman/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/rackman/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/usr/local/rackman/app/cumulus_switch_manager.py", line 26, in fetch_cumulus_api
response = requests.get(url,
File "/usr/local/rackman/app/cumulus_switch_manager.py", line 42, in get_bgp_data
bgp_data = self.fetch_cumulus_api('vrf/default/router/bgp/neighbor')
File "/usr/local/rackman/app/cumulus_switch_manager.py", line 63, in generate_heartbeat_data
bgp_data = self.get_bgp_data()
File "/usr/local/rackman/app/cumulus_switch_manager.py", line 101, in send_heartbeat
heartbeat_data = self.generate_heartbeat_data()
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/hub.py", line 647, in run
loop.run()
What I've run:
def schedule_task(delay, func):
gevent.spawn(func)
gevent.spawn_later(delay, schedule_task, delay, func)
def send_heartbeat():
response = requests.get('https://{ip}:{port}/some-path')
def send_network_device_heartbeat():
for network_device in network_devices:
gevent.spawn(send_heartbeat)
def update_network_devices():
new_network_devices = []
# populate new_network_devices list ...
network_devices = new_network_devices
schedule_task(60, send_network_device_heartbeat)
schedule_task(900, update_network_devices)
Any help why this is getting stuck or how to get around this would be really helpful We have seen that if we replace gevent.spawn with python threads then this issue doesn't occur
I have the same problem
me too
Based on that backtrace, I theorize that you're filling up gevent's thread pool because you're spawning way too many greenlets, all of which are, essentially, waiting on each other. If so, the quick solution is to use a non-threaded DNS resolver implementation. (You could also try increasing the number of threads that the threadpool will use, which defaults to 10, but that's just a bandaid that doesn't directly address the symptom or the cause.) I'll suggest what I think is a more proper solution below.
The underlying problem comes from this code:
def schedule_task(delay, func):
gevent.spawn(func)
gevent.spawn_later(delay, schedule_task, delay, func)
Each time that is called, it immediately creates two greenlets, one which will be switched to the next time the event loop runs around and gets to invoking callbacks, and one which will be switched to after an event loop timer completes.
Note that timers effectively have higher priority than callbacks: timer code is always run on the iteration of the event loop in which it expired, but callbacks may have to wait longer than that, depending on the value of getswitchinterval() and the contents of the callback queue. Thus, spawning the delayed greenlet --- which itself spawns two more! --- is a higher priority than servicing an existing greenlet that may have results ready. (This suggests you can increase the switch interval, and that might be another bandaid, but it also doesn't solve the underlying issue.)
If everything is working exactly perfectly, these greenlets will not build up; each one will finish before the next two are created. But what if things are going more slowly than anticipated? That is, what if executing func takes longer than delay? This could be due to an overloaded event loop, or network latency/timeouts, or blocking (non-cooperative) code, or... Let's walk through that scenario.
At time 0, two greenlets are created, call them "func greenlet 1" and "schedule greenlet 1". At the next iteration of the event loop, "func greenlet 1" begins running and continues to run until it hits a function using the gevent API; in this case, that's getaddrinfo, and that call has occupied one of the 10 available slots in the threadpool. At that time, some other greenlet --- possibly the hub --- is switched to.
At time 0 + delay, the timer expires and "schedule greenlet 1" runs, which now creates "func greenlet 2" and "schedule greenlet 2". At this time, three are three greenlets, one of which is "stuck" waiting on the results of getaddrinfo.
At the next iteration of the event loop, "func greenlet 2" runs, queues up a threadpool call for getaddrinfo and switches out. At time 0 + 2 x delay, "schedule greenlet 2" gets to run. On the next trip through the event loop, there are now "func greenlet 1" (waiting on getaddrinfo), "func greenlet 2" (ditto) and "func greenlet 3" (yup, ditto), plus the new "schedule greenlet 3".
Repeat this more than 10 times (or less if you're using the threadpool for other long-running tasks), and you'll find your threadpool occupied and unable to accept new tasks.
Now, the getaddrinfo calls, which were running on separate threads, should ultimately finish and release their slots to allow the main thread to submit new work, but apparently that's not happening. The reasons why are unclear to me without seeing what other threads are doing. I suggest using something like py-spy dump to get a complete list of what all threads are doing.
Ok, having thought through that exercise, I would rewrite schedule_task something like the following (untested, but I think the basic idea is correct). In this way, you never risk building up greenlets all trying to do the same thing, and you also don't flood the target (if a bunch of "func greenlet X" finally get their callbacks run at the same time), no matter how the event loop is being used.
def schedule_task(delay, func):
# We want to run *func* every *delay* seconds, and it would be good to
# do this as close to that time limit as possible. We compensate for the time it
# takes to run in order to do this.
def run_it():
while True:
begin_time = time.time()
next_desired_run_time = begin_time + delay
func() # Add exception handling as desired.
sleep_duration = next_desired_run_time - time.time()
if sleep_duration < 0:
# Uh-oh! We ran longer than *delay* seconds. Depending on
# our goals, we could run again immediately, or we could go back to deferring for
# *delay* seconds. Here, we're choosing the former approach, but allowing
# gevent the chance to switch to other greenlets first, but only if there is something
# immediately runnable. Choosing a small, non-zero delay, will force a trip around the
# event loop, which can be kinder to other greenlets doing IO.
gevent.sleep(0)
else:
gevent.sleep(sleep_duration)
return gevent.spawn(run_it)
Alternately, you could make func responsible for scheduling its next execution as the very last thing it does before returning. It could/should employ the same delay adjustment as above.
Thanks @jamadden for looking at it! There are few things I want to point out here
- We are actually passing a timeout to the requests.get call (Sorry didn't mention in the snippet just to keep it short). Here is how it is used
requests.get(url, headers=headers, verify=False, timeout=45, auth=(username, password))So, I don't see how next greenlet can overlap with the previous ones. Since the next run is at time60while the current one would have ended by time45. - The no. of threads was < 10 every time which I checked by running
ps -Tf -p <pid>. I have seen the thread count varying with 1, 2, 3. - We are directly passing the ip address here and not using DNS name. So, I don't understand why DNS resolution will be required.
- Also in
schedule_taskthe code usesgevent.spawn_later(0, func)instead ofgevent.spawn(Sorry again, I thought spawn_later with 0 delay is same as spawn) - There are other places in the application we are using
schedule_taskin this recursive manner which don't userequests.getbut they don't have this problem. In this specific scenario where there is a requests.get call inside gevent.spawn is where we are seeing this issue.
I will change schedule_task as you proposed and check if the problem goes away.
We are actually passing a timeout to the requests.get call
Somewhat immaterial, as that doesn't provide a timeout for the entire operation, only the part where you're reading data from the server. Connecting and address resolution are separate.
If you want a timeout for the entire operation, use the functional or context manager versions of gevent.Timeout
We are directly passing the ip address here and not using DNS name.
Also somewhat immaterial. The same code path is used, resulting in a call to getaddrinfo all the time, and that's the part dealing with the threadpool.
There are other places in the application we are using schedule_task in this recursive manner which don't use requests.get but they don't have this problem.
As I said, I can only theorize based on one single stacktrace of a multi-threaded process and a very small code snippet that doesn't produce any problems for me.
Oh, one thing I neglected to mention and which can only compound the problem is spawning the "spawn two greenlets" function for every item in network_devices. So instead of spawning two greenlets, its spawning 3 * len(network_devices). How many of those do you have?
For further debugging, instead of just the Python thread stack traces (with, e.g., py-spy) you probably also want to get the stack traces of all waiting greenlets. gevent provides functions for this, and those are commonly arranged to be invoked by a trivial signal handler function like signal.signal(signal.SIGPROF, lambda *_a, **_kw: print_run_info()). The signal should interrupt the lock waits and still run (I think) if you use the non monkey-patched signal.signal call (either do so before patching or use monkey.get_original). Alternately, enable gevent's monitoring features; the blocking monitor may get you everything you need, and if not, you can arrange for it to periodically call one of the _run_info() functions.
Just to clarify, your sample code isn't monkey-patching, but the provided stack trace appears to be from a monkey-patched system. I assume you are actually monkey-patching? And hopefully not setting any special options, just calling monkey.patch_all()? (Only patching some parts of the system isn't tested.)
So instead of spawning two greenlets, its spawning 3 * len(network_devices). How many of those do you have?
I thought it was 2 + len(network_devices). 2 from schedule_task and 1 each for the network_devices. On average the len(network_devices) is 4
I assume you are actually monkey-patching?
Yes it's fully monkey patched gevent.monkey.patch_all()
@jamadden this is the link to the print_run_info when this issue happens. I am seeing >10 ThreadPool Worker in the run info. Could this issue be because of that then?
This is the code that I had used
import signal
import gevent.util
def gevent_run_info(*_, **__):
with open('/tmp/gevent-run-info.log', 'a') as fd:
gevent.util.print_run_info(file=fd)
signal.signal(signal.SIGPROF, gevent_run_info)
import gevent.monkey
gevent.monkey.patch_all()
# rest of the code then ...
After that I got the run info with kill -s SIGPROF <process-id>
The concerning thing is that all of the worker threads are dead:
<ThreadPoolWorker at 0x7f4ff1fa1d80 thread_ident=0x7f4ff46a2700 threadpool-hub=<missing>>; not running
: Parent: <greenlet.greenlet object at 0x7f4ff1fa16c0 (otid=0x(nil)) (thread exited) dead>
Something is killing them that isn't getting caught and handled. Figuring out what's going on there will be important.
I'm sure you noticed that you have 28 greenlets that died because of a programming error (finished with exception UnboundLocalError("cannot access local variable 'version_data' where it is not associated with a value")). I don't see how that can be responsible for this problem, but it would be good to fix it. (And as always, good to be sure you stop referencing dead greenlets as soon as possible.)
There are some places we are raising gevent.GreenletExit to terminate a greenlet. Could that explain ThreadPool worker going to dead state?
@jamadden after fixing the UnboundLocalError I have this new log link for gevent print_run_info when the process went to hung state. In this case the ThreadPoolWorker size is only 1, so it seems the pool is not maxing out. Do you see anything interesting in the logs?
In this case the signal was added after monkey patching (since there were some challenges in adding it before gevent monkey patching). This is the code
def gevent_run_info(*_, **__):
with open('/tmp/gevent-run-info.log', 'a') as fd:
gevent.util.print_run_info(file=fd)
signal, SIGPROF = gevent.monkey.get_original('signal', ('signal', 'SIGPROF'))
signal(SIGPROF, gevent_run_info)
We're having a similar issue. One of our greenlets gets stuck, in this case when a connection is returned to the pool, and ends up freezing the whole application. This is the stack trace of the relevant greenlet when it's hung (application-specific trace removed):
File [our_app], line 120, in [some_func]
response = requests.get(
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 746, in send
r.content
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 902, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 820, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 1043, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 935, in read
data = self._raw_read(amt)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 861, in _raw_read
with self._error_catcher():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 788, in _error_catcher
self.release_conn()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 628, in release_conn
self._pool._put_conn(self._connection)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 313, in _put_conn
self.pool.put(conn, block=False)
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/queue.py", line 152, in put
self.not_empty.notify()
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 369, in notify
if not self._is_owned():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 282, in _is_owned
if self._lock.acquire(False):
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/thread.py", line 132, in acquire
sleep()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep
waiter.get()
The same, and the last error log is below
Traceback (most recent call last):
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
File "/data/python_env/web/lib64/python3.9/site-packages/gevent/_socketcommon.py", line 613, in connect
self._wait(self._write_event)
File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen
httplib_response = self._make_request(
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
self._validate_conn(conn)
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
conn.connect()
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 179, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fbff871c250>, 'Connection to {IP} timed out. (connect timeout=0.1)')
During handling of the above exception, another exception occurred:
And other info, always in loop wait
(gdb) py-bt
Traceback (most recent call first):
File "/data/python_env/web/lib64/python3.9/site-packages/gevent/hub.py", line 647, in run
loop.run()
I‘m using pool.spwan now, is gevent.spwan_later better?
@jamadden I think the threadpool size is maxing out. Since I see the below trace for greenlets which gets stuck at self._mutex. The count of such occurrence is 11 in the run_info output, which is > default threadpool size of 10.
File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_socketcommon.py", line 221, in getaddrinfo
addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
return self.pool.apply(_socket.getaddrinfo, args, kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/pool.py", line 161, in apply
return self.spawn(func, *args, **kwds).get()
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/threadpool.py", line 545, in spawn
task_queue.put((func, args, kwargs, thread_result))
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_threading.py", line 188, in put
with self._mutex:
After changing the DNS resolver to c-ares (a non-threaded gevent dns resolver) we are not seeing this issue. Further solidifying my hypothesis.
We're having a similar issue. One of our greenlets gets stuck, in this case when a connection is returned to the pool, and ends up freezing the whole application. This is the stack trace of the relevant greenlet when it's hung (application-specific trace removed):
File [our_app], line 120, in [some_func] response = requests.get( File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 746, in send r.content File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 902, in content self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b"" File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 820, in generate yield from self.raw.stream(chunk_size, decode_content=True) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 1043, in stream data = self.read(amt=amt, decode_content=decode_content) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 935, in read data = self._raw_read(amt) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 861, in _raw_read with self._error_catcher(): File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 788, in _error_catcher self.release_conn() File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 628, in release_conn self._pool._put_conn(self._connection) File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 313, in _put_conn self.pool.put(conn, block=False) File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/queue.py", line 152, in put self.not_empty.notify() File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 369, in notify if not self._is_owned(): File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 282, in _is_owned if self._lock.acquire(False): File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/thread.py", line 132, in acquire sleep() File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep waiter.get()
I am having the exactly same issue as this one, could I ask if this has been solved?
: data = requests.post(url, headers=headers, params=params, json=body, timeout=0.5).json()
: File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
: return request("post", url, data=data, json=json, **kwargs)
: File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
: return session.request(method=method, url=url, **kwargs)
: File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
: resp = self.send(prep, **send_kwargs)
: File "/usr/local/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 221, in instrumented_send
: result = wrapped_send(
: File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
: r = adapter.send(request, **kwargs)
: File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 633, in send
: conn = self.get_connection_with_tls_context(
: File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in get_connection_with_tls_context
: conn = self.poolmanager.connection_from_host(
: File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 303, in connection_from_host
: return self.connection_from_context(request_context)
: File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 328, in connection_from_context
: return self.connection_from_pool_key(pool_key, request_context=request_context)
: File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 351, in connection_from_pool_key
: pool = self._new_pool(scheme, host, port, request_context=request_context)
: File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 265, in _new_pool
: return pool_cls(host, port, **request_context)
: File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 210, in __init__
: self.pool.put(None)
: File "/usr/local/lib/python3.10/queue.py", line 152, in put
: self.not_empty.notify()
: File "/usr/local/lib/python3.10/threading.py", line 369, in notify
: if not self._is_owned():
: File "/usr/local/lib/python3.10/threading.py", line 282, in _is_owned
: if self._lock.acquire(False):
: File "/usr/local/lib/python3.10/site-packages/gevent/thread.py", line 141, in acquire
: sleep()
: File "/usr/local/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep
: waiter.get()
Thanks.
@weimzh if I may ask, what URLs are you running the requests against? Does it always happen for the same domain, or is it more random?
same problem, here is stack trace:
Traceback (most recent call first):
<built-in method __enter__ of _thread.lock object at remote 0x7fcbd7f465c0>
File "/usr/local/lib/python3.12/site-packages/gevent/_threading.py", line 188, in put
with self._mutex:
File "/usr/local/lib/python3.12/site-packages/gevent/threadpool.py", line 545, in spawn
task_queue.put((func, args, kwargs, thread_result))
File "/usr/local/lib/python3.12/site-packages/gevent/pool.py", line 161, in apply
return self.spawn(func, *args, **kwds).get()
File "/usr/local/lib/python3.12/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
return self.pool.apply(_socket.getaddrinfo, args, kwargs)
File "/usr/local/lib/python3.12/site-packages/gevent/_socketcommon.py", line 221, in getaddrinfo
addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
File "/usr/local/lib/python3.12/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 704, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
here is code logic:
session = requests.Session()
req = requests.Request(
method=method,
url=url,
headers=headers,
data=post_data,
)
session.prepare_request(req)
response = session.send(
prepared_req,
timeout=timeout,
stream=True # Enable streaming
)
I'm getting the same issue as @agubelu and @weimzh sporadically --
Traceback (most recent call last):
File "[our_app]", line 382, in [some_func]
response = requests.request(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 333, in instrumented_send
raise exception.with_traceback(exception.__traceback__)
File "/usr/local/lib/python3.11/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 255, in instrumented_send
result = wrapped_send(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 746, in send
r.content
File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 902, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 820, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1066, in stream
data = self.read(amt=amt, decode_content=decode_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 955, in read
data = self._raw_read(amt)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 878, in _raw_read
with self._error_catcher():
File "/usr/local/lib/python3.11/contextlib.py", line 144, in __exit__
next(self.gen)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 805, in _error_catcher
self.release_conn()
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 645, in release_conn
self._pool._put_conn(self._connection)
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 310, in _put_conn
self.pool.put(conn, block=False)
File "/usr/local/lib/python3.11/queue.py", line 152, in put
self.not_empty.notify()
File "/usr/local/lib/python3.11/threading.py", line 376, in notify
if not self._is_owned():
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/threading.py", line 289, in _is_owned
if self._lock.acquire(False):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/gevent/thread.py", line 296, in acquire
sleep()
File "/usr/local/lib/python3.11/site-packages/gevent/hub.py", line 159, in sleep
waiter.get()
File "src/gevent/_waiter.py", line 143, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
[our_app]._OuterTimeoutError
Our code that's calling requests.request is wrapped into gevent.Timeout and _OuterTimeoutError is our custom exception that's passed to gevent.Timeout. Timeout that's passed to requests.request is 15 seconds and timeout passed to gevent.Timeout is 16.5 seconds. But the exception happens in ~30 seconds (maybe sometimes more or less).
It happens for many different domains as I can see in logs.
Also I observe WORKER TIMEOUT (pid:<pid>) errors (indicating that gunicorn is killing an unresponsive gevent worker) in logs at the same time as the mentioned above exceptions.
A small update on the issue -- I've upgraded gunicorn and gevent to the latest versions and the mentioned "requests" issue disappeared. Though the "WORKER TIMEOUT" errors were remaining. So I've started investigating it further - I've added a simple monitoring thread created specifically for detecting gunicorn issues of this kind (here is the thread monitor source code, maybe it can be helpful for someone else - https://gist.github.com/precious/852b1a74f16434bb05d44aede8a70774). All emails sent by the monitoring thread were containing the next stack trace:
File "/usr/local/lib/python3.11/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
return handle(*args_tuple)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
super().handle(listener, client, addr)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base_async.py", line 54, in handle
self.handle_request(listener_name, req, client, addr)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
super().handle_request(listener_name, req, sock, addr)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.11/site-packages/django/core/handlers/wsgi.py", line 124, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.11/site-packages/django/core/handlers/base.py", line 140, in get_response
response = self._middleware_chain(request)
<...>
File "/usr/local/lib/python3.11/site-packages/health_check/views.py", line 88, in get
health_check_has_error = self.check(subset)
File "/usr/local/lib/python3.11/site-packages/health_check/mixins.py", line 23, in check
return self.run_check(subset=subset)
File "/usr/local/lib/python3.11/site-packages/health_check/mixins.py", line 81, in run_check
for plugin in executor.map(_run, plugin_instances):
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 608, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 162, in submit
with self._shutdown_lock, _global_shutdown_lock:
So, it turned out the issue was happening inside the "django-health-check" package that we use -- in a ThreadPoolExecutor-related call. After disabling threading in django-health-check config, we stopped seeing the issue. Some time ago we faced a similar issue - we were using ThreadPoolExecutor in our own code and it was also resulting in "WORKER TIMEOUT" sporadically. When we rewrote that part using greentlets explicitly -- that issue also disappeared (so, we added a condition - in gevent environment use greenlets, otherwise use ThreadPoolExecutor).
After some extra investigation I've added a stack trace print to python's standard library "threading" module (threading.py) and found out that we import threading in gevent worker process before monkey patching. It was the root cause of the issue. BEFORE:
# part of gunicorn.conf.py
from psycogreen.gevent import patch_psycopg
try:
import grpc.experimental.gevent as grpc_gevent
except ImportError:
grpc_gevent = None
def post_fork(server, worker):
patch_psycopg()
if grpc_gevent is not None:
grpc_gevent.init_gevent()
(bad! "post_fork()" is called in gevent worker process before monkey patching, whereas "grpc_gevent.init_gevent()" must be called after monkey patching according the grpc_gevent.init_gevent docstring - because it's importing threading and maybe something else)
AFTER:
# part of gunicorn.conf.py
from gevent import events as gevent_events
def post_patch(event):
# this function is only called for gevent worker.
# from gevent docs (https://www.gevent.org/api/gevent.monkey.html):
# You can also subscribe to the events to provide additional patching beyond what gevent distributes, either for
# additional standard library modules, or for third-party packages. The suggested time to do this patching is in
# the subscriber for gevent.events.GeventDidPatchBuiltinModulesEvent.
if isinstance(event, gevent_events.GeventDidPatchBuiltinModulesEvent):
from psycogreen.gevent import patch_psycopg
patch_psycopg()
try:
import grpc.experimental.gevent as grpc_gevent
# according to the "init_gevent()" method docstring:
# This must be called AFTER the python standard lib has been patched, but BEFORE creating and gRPC objects.
# so we cannot call it in "post_fork"
grpc_gevent.init_gevent()
except ImportError:
pass
gevent_events.subscribers.append(post_patch)
Now the issue is fixed, we do not observe the "WORKER TIMEOUT" errors anymore and we can use ThreadPoolExecutor from the standard library since it's patched properly now (no need to use greenlets explicitly).
For what it's worth, I observed the same symptom that agubelu did in the comment https://github.com/gevent/gevent/issues/2060#issuecomment-2428984089 .
I had many greenlets polling and waiting, and one greenlet with a stack ending in _is_owned -> acquire -> sleep. Also installed from nix for what it's worth. It would stay in this state indefinitely until killed.
Refactoring away from requests did not fix the issue.
Upgrading gevent (24.2.1 -> 25.4.2) + greenlet (3.1.1 -> 3.2.1) fixed the bug completely.
From the changelog notes for version 25.4.1:
Monkey-patching the queue module (done by default in patch_all) now patches Queue, PriorityQueue, and LifoQueue. In addition to the general benefits of making all those classes cooperative, this is known to solve a non-deterministic deadlock with urllib3.
So it looks like upgrading to >=25.4.1 does indeed solve the issue.