gevent icon indicating copy to clipboard operation
gevent copied to clipboard

requests.get inside gevent.spawn is getting stuck

Open anupamdialpad opened this issue 1 year ago • 20 comments

  • gevent version: version 24.2.1 . Installed using pip.
  • Python version: cPython 3.11.4 downloaded from https://github.com/python/cpython/releases/tag/v3.11.4
  • Operating System: Debian 11 kernel version 5.10.0-29-amd64
  • requests version: 2.32.3 installed using pip

Description:

Recently we are seeing that requests.get inside a gevent.spawn is waiting for a lock which it is unable to acquire and getting stuck forever. After this the thread is unable to process anything. Since this happens on the main thread our main app goes to a stuck state unable to process any other things

Below is the stack trace that I got after running py-bt from gdb

  <built-in method __enter__ of _thread.lock object at remote 0x7f3350025540>
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_threading.py", line 188, in put
    with self._mutex:
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/threadpool.py", line 545, in spawn
    task_queue.put((func, args, kwargs, thread_result))
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/pool.py", line 161, in apply
    return self.spawn(func, *args, **kwds).get()
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
    return self.pool.apply(_socket.getaddrinfo, args, kwargs)
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_socketcommon.py", line 225, in getaddrinfo
    addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connection.py", line 196, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connection.py", line 615, in connect
    self.sock = sock = self._new_conn()
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/usr/local/rackman/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/usr/local/rackman/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/rackman/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/rackman/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/rackman/lib/python3.11/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/rackman/app/cumulus_switch_manager.py", line 26, in fetch_cumulus_api
    response = requests.get(url,
  File "/usr/local/rackman/app/cumulus_switch_manager.py", line 42, in get_bgp_data
    bgp_data = self.fetch_cumulus_api('vrf/default/router/bgp/neighbor')
  File "/usr/local/rackman/app/cumulus_switch_manager.py", line 63, in generate_heartbeat_data
    bgp_data = self.get_bgp_data()
  File "/usr/local/rackman/app/cumulus_switch_manager.py", line 101, in send_heartbeat
    heartbeat_data = self.generate_heartbeat_data()
  File "/usr/local/rackman/lib/python3.11/site-packages/gevent/hub.py", line 647, in run
    loop.run()

What I've run:

def schedule_task(delay, func):
  gevent.spawn(func)
  gevent.spawn_later(delay, schedule_task, delay, func)

def send_heartbeat():
  response = requests.get('https://{ip}:{port}/some-path')

def send_network_device_heartbeat():
  for network_device in network_devices:
    gevent.spawn(send_heartbeat)

def update_network_devices():
  new_network_devices = []
  # populate new_network_devices list ...
  network_devices = new_network_devices

schedule_task(60, send_network_device_heartbeat)
schedule_task(900, update_network_devices)

Any help why this is getting stuck or how to get around this would be really helpful We have seen that if we replace gevent.spawn with python threads then this issue doesn't occur

anupamdialpad avatar Sep 24 '24 06:09 anupamdialpad

I have the same problem

dapanin avatar Sep 24 '24 08:09 dapanin

me too

Str1kez avatar Sep 24 '24 09:09 Str1kez

Based on that backtrace, I theorize that you're filling up gevent's thread pool because you're spawning way too many greenlets, all of which are, essentially, waiting on each other. If so, the quick solution is to use a non-threaded DNS resolver implementation. (You could also try increasing the number of threads that the threadpool will use, which defaults to 10, but that's just a bandaid that doesn't directly address the symptom or the cause.) I'll suggest what I think is a more proper solution below.

The underlying problem comes from this code:

def schedule_task(delay, func):
  gevent.spawn(func)
  gevent.spawn_later(delay, schedule_task, delay, func)

Each time that is called, it immediately creates two greenlets, one which will be switched to the next time the event loop runs around and gets to invoking callbacks, and one which will be switched to after an event loop timer completes.

Note that timers effectively have higher priority than callbacks: timer code is always run on the iteration of the event loop in which it expired, but callbacks may have to wait longer than that, depending on the value of getswitchinterval() and the contents of the callback queue. Thus, spawning the delayed greenlet --- which itself spawns two more! --- is a higher priority than servicing an existing greenlet that may have results ready. (This suggests you can increase the switch interval, and that might be another bandaid, but it also doesn't solve the underlying issue.)

If everything is working exactly perfectly, these greenlets will not build up; each one will finish before the next two are created. But what if things are going more slowly than anticipated? That is, what if executing func takes longer than delay? This could be due to an overloaded event loop, or network latency/timeouts, or blocking (non-cooperative) code, or... Let's walk through that scenario.

At time 0, two greenlets are created, call them "func greenlet 1" and "schedule greenlet 1". At the next iteration of the event loop, "func greenlet 1" begins running and continues to run until it hits a function using the gevent API; in this case, that's getaddrinfo, and that call has occupied one of the 10 available slots in the threadpool. At that time, some other greenlet --- possibly the hub --- is switched to.

At time 0 + delay, the timer expires and "schedule greenlet 1" runs, which now creates "func greenlet 2" and "schedule greenlet 2". At this time, three are three greenlets, one of which is "stuck" waiting on the results of getaddrinfo.

At the next iteration of the event loop, "func greenlet 2" runs, queues up a threadpool call for getaddrinfo and switches out. At time 0 + 2 x delay, "schedule greenlet 2" gets to run. On the next trip through the event loop, there are now "func greenlet 1" (waiting on getaddrinfo), "func greenlet 2" (ditto) and "func greenlet 3" (yup, ditto), plus the new "schedule greenlet 3".

Repeat this more than 10 times (or less if you're using the threadpool for other long-running tasks), and you'll find your threadpool occupied and unable to accept new tasks.

Now, the getaddrinfo calls, which were running on separate threads, should ultimately finish and release their slots to allow the main thread to submit new work, but apparently that's not happening. The reasons why are unclear to me without seeing what other threads are doing. I suggest using something like py-spy dump to get a complete list of what all threads are doing.

Ok, having thought through that exercise, I would rewrite schedule_task something like the following (untested, but I think the basic idea is correct). In this way, you never risk building up greenlets all trying to do the same thing, and you also don't flood the target (if a bunch of "func greenlet X" finally get their callbacks run at the same time), no matter how the event loop is being used.

def schedule_task(delay, func):
    # We want to run *func* every *delay*  seconds, and it would be good to 
    # do this as close to that time limit as possible. We compensate for the time it 
    # takes to run in order to do this.
   def run_it():
        while True:
            begin_time = time.time()
            next_desired_run_time = begin_time + delay
            func() # Add exception handling as desired.
            sleep_duration = next_desired_run_time - time.time()
            if sleep_duration < 0:
                # Uh-oh! We ran longer than *delay* seconds. Depending on 
                # our goals, we could run again immediately, or we could go back to deferring for 
                # *delay* seconds. Here, we're choosing the former approach, but allowing 
                # gevent the chance to switch to other greenlets first, but only if there is something 
                # immediately runnable. Choosing a small, non-zero delay, will force a trip around the 
                # event loop, which can be kinder to other greenlets doing IO.
                gevent.sleep(0)
            else:
                gevent.sleep(sleep_duration)
    return gevent.spawn(run_it)

Alternately, you could make func responsible for scheduling its next execution as the very last thing it does before returning. It could/should employ the same delay adjustment as above.

jamadden avatar Sep 24 '24 14:09 jamadden

Thanks @jamadden for looking at it! There are few things I want to point out here

  1. We are actually passing a timeout to the requests.get call (Sorry didn't mention in the snippet just to keep it short). Here is how it is used requests.get(url, headers=headers, verify=False, timeout=45, auth=(username, password)) So, I don't see how next greenlet can overlap with the previous ones. Since the next run is at time 60 while the current one would have ended by time 45.
  2. The no. of threads was < 10 every time which I checked by running ps -Tf -p <pid> . I have seen the thread count varying with 1, 2, 3.
  3. We are directly passing the ip address here and not using DNS name. So, I don't understand why DNS resolution will be required.
  4. Also in schedule_task the code uses gevent.spawn_later(0, func) instead of gevent.spawn (Sorry again, I thought spawn_later with 0 delay is same as spawn)
  5. There are other places in the application we are using schedule_task in this recursive manner which don't use requests.get but they don't have this problem. In this specific scenario where there is a requests.get call inside gevent.spawn is where we are seeing this issue.

I will change schedule_task as you proposed and check if the problem goes away.

anupamdialpad avatar Sep 25 '24 05:09 anupamdialpad

We are actually passing a timeout to the requests.get call

Somewhat immaterial, as that doesn't provide a timeout for the entire operation, only the part where you're reading data from the server. Connecting and address resolution are separate.

If you want a timeout for the entire operation, use the functional or context manager versions of gevent.Timeout

We are directly passing the ip address here and not using DNS name.

Also somewhat immaterial. The same code path is used, resulting in a call to getaddrinfo all the time, and that's the part dealing with the threadpool.

There are other places in the application we are using schedule_task in this recursive manner which don't use requests.get but they don't have this problem.

As I said, I can only theorize based on one single stacktrace of a multi-threaded process and a very small code snippet that doesn't produce any problems for me.

Oh, one thing I neglected to mention and which can only compound the problem is spawning the "spawn two greenlets" function for every item in network_devices. So instead of spawning two greenlets, its spawning 3 * len(network_devices). How many of those do you have?

For further debugging, instead of just the Python thread stack traces (with, e.g., py-spy) you probably also want to get the stack traces of all waiting greenlets. gevent provides functions for this, and those are commonly arranged to be invoked by a trivial signal handler function like signal.signal(signal.SIGPROF, lambda *_a, **_kw: print_run_info()). The signal should interrupt the lock waits and still run (I think) if you use the non monkey-patched signal.signal call (either do so before patching or use monkey.get_original). Alternately, enable gevent's monitoring features; the blocking monitor may get you everything you need, and if not, you can arrange for it to periodically call one of the _run_info() functions.

jamadden avatar Sep 25 '24 10:09 jamadden

Just to clarify, your sample code isn't monkey-patching, but the provided stack trace appears to be from a monkey-patched system. I assume you are actually monkey-patching? And hopefully not setting any special options, just calling monkey.patch_all()? (Only patching some parts of the system isn't tested.)

jamadden avatar Sep 25 '24 10:09 jamadden

So instead of spawning two greenlets, its spawning 3 * len(network_devices). How many of those do you have?

I thought it was 2 + len(network_devices). 2 from schedule_task and 1 each for the network_devices. On average the len(network_devices) is 4

I assume you are actually monkey-patching?

Yes it's fully monkey patched gevent.monkey.patch_all()

anupamdialpad avatar Sep 25 '24 11:09 anupamdialpad

@jamadden this is the link to the print_run_info when this issue happens. I am seeing >10 ThreadPool Worker in the run info. Could this issue be because of that then?

anupamdialpad avatar Oct 05 '24 04:10 anupamdialpad

This is the code that I had used

import signal
import gevent.util

def gevent_run_info(*_, **__):
  with open('/tmp/gevent-run-info.log', 'a') as fd:
    gevent.util.print_run_info(file=fd)

signal.signal(signal.SIGPROF, gevent_run_info)

import gevent.monkey
gevent.monkey.patch_all()

# rest of the code then ...

After that I got the run info with kill -s SIGPROF <process-id>

anupamdialpad avatar Oct 05 '24 04:10 anupamdialpad

The concerning thing is that all of the worker threads are dead:

<ThreadPoolWorker at 0x7f4ff1fa1d80 thread_ident=0x7f4ff46a2700 threadpool-hub=<missing>>; not running
 :    Parent: <greenlet.greenlet object at 0x7f4ff1fa16c0 (otid=0x(nil)) (thread exited) dead>

Something is killing them that isn't getting caught and handled. Figuring out what's going on there will be important.

I'm sure you noticed that you have 28 greenlets that died because of a programming error (finished with exception UnboundLocalError("cannot access local variable 'version_data' where it is not associated with a value")). I don't see how that can be responsible for this problem, but it would be good to fix it. (And as always, good to be sure you stop referencing dead greenlets as soon as possible.)

jamadden avatar Oct 09 '24 21:10 jamadden

There are some places we are raising gevent.GreenletExit to terminate a greenlet. Could that explain ThreadPool worker going to dead state?

anupamdialpad avatar Oct 15 '24 02:10 anupamdialpad

@jamadden after fixing the UnboundLocalError I have this new log link for gevent print_run_info when the process went to hung state. In this case the ThreadPoolWorker size is only 1, so it seems the pool is not maxing out. Do you see anything interesting in the logs?

In this case the signal was added after monkey patching (since there were some challenges in adding it before gevent monkey patching). This is the code

def gevent_run_info(*_, **__):
  with open('/tmp/gevent-run-info.log', 'a') as fd:
    gevent.util.print_run_info(file=fd)

signal, SIGPROF = gevent.monkey.get_original('signal', ('signal', 'SIGPROF'))
signal(SIGPROF, gevent_run_info)

anupamdialpad avatar Oct 18 '24 15:10 anupamdialpad

We're having a similar issue. One of our greenlets gets stuck, in this case when a connection is returned to the pool, and ends up freezing the whole application. This is the stack trace of the relevant greenlet when it's hung (application-specific trace removed):

File [our_app], line 120, in [some_func]
  response = requests.get(
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 73, in get
  return request("get", url, params=params, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 59, in request
  return session.request(method=method, url=url, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
  resp = self.send(prep, **send_kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 746, in send
  r.content
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 902, in content
  self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 820, in generate
  yield from self.raw.stream(chunk_size, decode_content=True)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 1043, in stream
  data = self.read(amt=amt, decode_content=decode_content)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 935, in read
  data = self._raw_read(amt)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 861, in _raw_read
  with self._error_catcher():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/contextlib.py", line 142, in __exit__
  next(self.gen)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 788, in _error_catcher
  self.release_conn()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 628, in release_conn
  self._pool._put_conn(self._connection)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 313, in _put_conn
  self.pool.put(conn, block=False)
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/queue.py", line 152, in put
  self.not_empty.notify()
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 369, in notify
  if not self._is_owned():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 282, in _is_owned
  if self._lock.acquire(False):
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/thread.py", line 132, in acquire
  sleep()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep
  waiter.get()

agubelu avatar Oct 22 '24 11:10 agubelu

The same, and the last error log is below

Traceback (most recent call last):
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
  File "/data/python_env/web/lib64/python3.9/site-packages/gevent/_socketcommon.py", line 613, in connect
    self._wait(self._write_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/data/python_env/web/lib64/python3.9/site-packages/urllib3/connection.py", line 179, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fbff871c250>, 'Connection to {IP} timed out. (connect timeout=0.1)')

During handling of the above exception, another exception occurred:

And other info, always in loop wait

(gdb) py-bt
Traceback (most recent call first):
  File "/data/python_env/web/lib64/python3.9/site-packages/gevent/hub.py", line 647, in run
    loop.run()

I‘m using pool.spwan now, is gevent.spwan_later better?

betterlch avatar Nov 01 '24 09:11 betterlch

@jamadden I think the threadpool size is maxing out. Since I see the below trace for greenlets which gets stuck at self._mutex. The count of such occurrence is 11 in the run_info output, which is > default threadpool size of 10.

File "/usr/local/rackman/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
  for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_socketcommon.py", line 221, in getaddrinfo
  addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
  return self.pool.apply(_socket.getaddrinfo, args, kwargs)
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/pool.py", line 161, in apply
  return self.spawn(func, *args, **kwds).get()
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/threadpool.py", line 545, in spawn
  task_queue.put((func, args, kwargs, thread_result))
File "/usr/local/rackman/lib/python3.11/site-packages/gevent/_threading.py", line 188, in put
  with self._mutex:

After changing the DNS resolver to c-ares (a non-threaded gevent dns resolver) we are not seeing this issue. Further solidifying my hypothesis.

anupamdialpad avatar Nov 11 '24 08:11 anupamdialpad

We're having a similar issue. One of our greenlets gets stuck, in this case when a connection is returned to the pool, and ends up freezing the whole application. This is the stack trace of the relevant greenlet when it's hung (application-specific trace removed):

File [our_app], line 120, in [some_func]
  response = requests.get(
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 73, in get
  return request("get", url, params=params, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/api.py", line 59, in request
  return session.request(method=method, url=url, **kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
  resp = self.send(prep, **send_kwargs)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/sessions.py", line 746, in send
  r.content
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 902, in content
  self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/requests/models.py", line 820, in generate
  yield from self.raw.stream(chunk_size, decode_content=True)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 1043, in stream
  data = self.read(amt=amt, decode_content=decode_content)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 935, in read
  data = self._raw_read(amt)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 861, in _raw_read
  with self._error_catcher():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/contextlib.py", line 142, in __exit__
  next(self.gen)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 788, in _error_catcher
  self.release_conn()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/response.py", line 628, in release_conn
  self._pool._put_conn(self._connection)
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 313, in _put_conn
  self.pool.put(conn, block=False)
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/queue.py", line 152, in put
  self.not_empty.notify()
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 369, in notify
  if not self._is_owned():
File "/nix/store/z3qj3yqyk4gl5ywsp0b2q3xi18v7scv8-python3-3.10.14/lib/python3.10/threading.py", line 282, in _is_owned
  if self._lock.acquire(False):
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/thread.py", line 132, in acquire
  sleep()
File "/nix/store/807ahcfhmwiahs987q7jsxf11af7066a-python3-3.10.14-env/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep
  waiter.get()

I am having the exactly same issue as this one, could I ask if this has been solved?

 :                          data = requests.post(url, headers=headers, params=params, json=body, timeout=0.5).json()
 :                        File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
 :                          return request("post", url, data=data, json=json, **kwargs)
 :                        File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
 :                          return session.request(method=method, url=url, **kwargs)
 :                        File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
 :                          resp = self.send(prep, **send_kwargs)
 :                        File "/usr/local/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 221, in instrumented_send
 :                          result = wrapped_send(
 :                        File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
 :                          r = adapter.send(request, **kwargs)
 :                        File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 633, in send
 :                          conn = self.get_connection_with_tls_context(
 :                        File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in get_connection_with_tls_context
 :                          conn = self.poolmanager.connection_from_host(
 :                        File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 303, in connection_from_host
 :                          return self.connection_from_context(request_context)
 :                        File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 328, in connection_from_context
 :                          return self.connection_from_pool_key(pool_key, request_context=request_context)
 :                        File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 351, in connection_from_pool_key
 :                          pool = self._new_pool(scheme, host, port, request_context=request_context)
 :                        File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 265, in _new_pool
 :                          return pool_cls(host, port, **request_context)
 :                        File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 210, in __init__
 :                          self.pool.put(None)
 :                        File "/usr/local/lib/python3.10/queue.py", line 152, in put
 :                          self.not_empty.notify()
 :                        File "/usr/local/lib/python3.10/threading.py", line 369, in notify
 :                          if not self._is_owned():
 :                        File "/usr/local/lib/python3.10/threading.py", line 282, in _is_owned
 :                          if self._lock.acquire(False):
 :                        File "/usr/local/lib/python3.10/site-packages/gevent/thread.py", line 141, in acquire
 :                          sleep()
 :                        File "/usr/local/lib/python3.10/site-packages/gevent/hub.py", line 159, in sleep
 :                          waiter.get()

Thanks.

weimzh avatar Jan 03 '25 05:01 weimzh

@weimzh if I may ask, what URLs are you running the requests against? Does it always happen for the same domain, or is it more random?

agubelu avatar Jan 23 '25 16:01 agubelu

same problem, here is stack trace:

Traceback (most recent call first):
  <built-in method __enter__ of _thread.lock object at remote 0x7fcbd7f465c0>
  File "/usr/local/lib/python3.12/site-packages/gevent/_threading.py", line 188, in put
    with self._mutex:
  File "/usr/local/lib/python3.12/site-packages/gevent/threadpool.py", line 545, in spawn
    task_queue.put((func, args, kwargs, thread_result))
  File "/usr/local/lib/python3.12/site-packages/gevent/pool.py", line 161, in apply
    return self.spawn(func, *args, **kwds).get()
  File "/usr/local/lib/python3.12/site-packages/gevent/resolver/thread.py", line 63, in getaddrinfo
    return self.pool.apply(_socket.getaddrinfo, args, kwargs)
  File "/usr/local/lib/python3.12/site-packages/gevent/_socketcommon.py", line 221, in getaddrinfo
    addrlist = get_hub().resolver.getaddrinfo(host, port, family, type, proto, flags)
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 704, in connect
    self.sock = sock = self._new_conn()
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)

here is code logic:

session = requests.Session()
req = requests.Request(
    method=method,
    url=url,
    headers=headers,
    data=post_data,
)
session.prepare_request(req)
response = session.send(
    prepared_req,
    timeout=timeout,
    stream=True  # Enable streaming
)

SmallPineApp1e avatar Mar 06 '25 05:03 SmallPineApp1e

I'm getting the same issue as @agubelu and @weimzh sporadically --

Traceback (most recent call last):
File "[our_app]", line 382, in [some_func]
response = requests.request(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 333, in instrumented_send
raise exception.with_traceback(exception.__traceback__)
File "/usr/local/lib/python3.11/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 255, in instrumented_send
result = wrapped_send(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 746, in send
r.content
File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 902, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/models.py", line 820, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1066, in stream
data = self.read(amt=amt, decode_content=decode_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 955, in read
data = self._raw_read(amt)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 878, in _raw_read
with self._error_catcher():
File "/usr/local/lib/python3.11/contextlib.py", line 144, in __exit__
next(self.gen)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 805, in _error_catcher
self.release_conn()
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 645, in release_conn
self._pool._put_conn(self._connection)
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 310, in _put_conn
self.pool.put(conn, block=False)
File "/usr/local/lib/python3.11/queue.py", line 152, in put
self.not_empty.notify()
File "/usr/local/lib/python3.11/threading.py", line 376, in notify
if not self._is_owned():
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/threading.py", line 289, in _is_owned
if self._lock.acquire(False):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/gevent/thread.py", line 296, in acquire
sleep()
File "/usr/local/lib/python3.11/site-packages/gevent/hub.py", line 159, in sleep
waiter.get()
File "src/gevent/_waiter.py", line 143, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
[our_app]._OuterTimeoutError

Our code that's calling requests.request is wrapped into gevent.Timeout and _OuterTimeoutError is our custom exception that's passed to gevent.Timeout. Timeout that's passed to requests.request is 15 seconds and timeout passed to gevent.Timeout is 16.5 seconds. But the exception happens in ~30 seconds (maybe sometimes more or less). It happens for many different domains as I can see in logs.

precious avatar Jun 13 '25 12:06 precious

Also I observe WORKER TIMEOUT (pid:<pid>) errors (indicating that gunicorn is killing an unresponsive gevent worker) in logs at the same time as the mentioned above exceptions.

precious avatar Jun 13 '25 12:06 precious

A small update on the issue -- I've upgraded gunicorn and gevent to the latest versions and the mentioned "requests" issue disappeared. Though the "WORKER TIMEOUT" errors were remaining. So I've started investigating it further - I've added a simple monitoring thread created specifically for detecting gunicorn issues of this kind (here is the thread monitor source code, maybe it can be helpful for someone else - https://gist.github.com/precious/852b1a74f16434bb05d44aede8a70774). All emails sent by the monitoring thread were containing the next stack trace:

  File "/usr/local/lib/python3.11/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
    return handle(*args_tuple)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
    super().handle(listener, client, addr)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base_async.py", line 54, in handle
    self.handle_request(listener_name, req, client, addr)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
    super().handle_request(listener_name, req, sock, addr)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.11/site-packages/django/core/handlers/wsgi.py", line 124, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.11/site-packages/django/core/handlers/base.py", line 140, in get_response
    response = self._middleware_chain(request)
  <...>
  File "/usr/local/lib/python3.11/site-packages/health_check/views.py", line 88, in get
    health_check_has_error = self.check(subset)
  File "/usr/local/lib/python3.11/site-packages/health_check/mixins.py", line 23, in check
    return self.run_check(subset=subset)
  File "/usr/local/lib/python3.11/site-packages/health_check/mixins.py", line 81, in run_check
    for plugin in executor.map(_run, plugin_instances):
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 162, in submit
    with self._shutdown_lock, _global_shutdown_lock:

So, it turned out the issue was happening inside the "django-health-check" package that we use -- in a ThreadPoolExecutor-related call. After disabling threading in django-health-check config, we stopped seeing the issue. Some time ago we faced a similar issue - we were using ThreadPoolExecutor in our own code and it was also resulting in "WORKER TIMEOUT" sporadically. When we rewrote that part using greentlets explicitly -- that issue also disappeared (so, we added a condition - in gevent environment use greenlets, otherwise use ThreadPoolExecutor).

After some extra investigation I've added a stack trace print to python's standard library "threading" module (threading.py) and found out that we import threading in gevent worker process before monkey patching. It was the root cause of the issue. BEFORE:

# part of gunicorn.conf.py
from psycogreen.gevent import patch_psycopg

try:
    import grpc.experimental.gevent as grpc_gevent
except ImportError:
    grpc_gevent = None


def post_fork(server, worker):
    patch_psycopg()

    if grpc_gevent is not None:
        grpc_gevent.init_gevent()

(bad! "post_fork()" is called in gevent worker process before monkey patching, whereas "grpc_gevent.init_gevent()" must be called after monkey patching according the grpc_gevent.init_gevent docstring - because it's importing threading and maybe something else)

AFTER:

# part of gunicorn.conf.py
from gevent import events as gevent_events

def post_patch(event):
    # this function is only called for gevent worker.
    # from gevent docs (https://www.gevent.org/api/gevent.monkey.html):
    # You can also subscribe to the events to provide additional patching beyond what gevent distributes, either for
    # additional standard library modules, or for third-party packages. The suggested time to do this patching is in
    # the subscriber for gevent.events.GeventDidPatchBuiltinModulesEvent.
    if isinstance(event, gevent_events.GeventDidPatchBuiltinModulesEvent):
        from psycogreen.gevent import patch_psycopg
        patch_psycopg()

        try:
            import grpc.experimental.gevent as grpc_gevent
            # according to the "init_gevent()" method docstring:
            # This must be called AFTER the python standard lib has been patched, but BEFORE creating and gRPC objects.
            # so we cannot call it in "post_fork"
            grpc_gevent.init_gevent()
        except ImportError:
            pass


gevent_events.subscribers.append(post_patch)

Now the issue is fixed, we do not observe the "WORKER TIMEOUT" errors anymore and we can use ThreadPoolExecutor from the standard library since it's patched properly now (no need to use greenlets explicitly).

precious avatar Jun 29 '25 14:06 precious

For what it's worth, I observed the same symptom that agubelu did in the comment https://github.com/gevent/gevent/issues/2060#issuecomment-2428984089 .

I had many greenlets polling and waiting, and one greenlet with a stack ending in _is_owned -> acquire -> sleep. Also installed from nix for what it's worth. It would stay in this state indefinitely until killed.

Refactoring away from requests did not fix the issue.

Upgrading gevent (24.2.1 -> 25.4.2) + greenlet (3.1.1 -> 3.2.1) fixed the bug completely.

smilliken avatar Jul 14 '25 22:07 smilliken

From the changelog notes for version 25.4.1:

Monkey-patching the queue module (done by default in patch_all) now patches Queue, PriorityQueue, and LifoQueue. In addition to the general benefits of making all those classes cooperative, this is known to solve a non-deterministic deadlock with urllib3.

So it looks like upgrading to >=25.4.1 does indeed solve the issue.

agubelu avatar Jul 16 '25 09:07 agubelu