gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

workers/gthread: Remove locks + one event queue + general cleanup

Open sylt opened this issue 1 year ago • 10 comments

The main purpose is to remove complexity from gthread by:

  • Removing the lock for handling self._keep and self.poller. This is possible since we now do all such manipulation on the main thread instead. When a connection is done, it posts a callback through the PollableMethodCaller which gets executed on the main thread.

  • Having a single event queue (self.poller), as opposed to also managing a set of futures. This fixes #3146 (although there are more minimal ways of doing it).

There are other more minor things as well:

  • Renaming some variables, e.g. self._keep to self.keepalived_conns.
  • Remove self-explanatory comments (what the code does, not why).
  • Remove fiddling with connection socket block/not blocked.

Some complexity has been added to the shutdown sequence, but hopefully for good reason: it's to make sure that all already accepted connections are served within the grace period.

sylt avatar Feb 17 '24 23:02 sylt

If it is so, then that doesn't sound good @javiertejero -- I would have thought the issue to be fixed! I can't reproduce any hanging behavior with while ab -n1000 -c100 http://0.0.0.0:8080/; do :; done running for a few minutes with the suggested pull request. However, on master I have no problem getting the error.

I'm testing the standalone app using the options:

    options = {
        'bind': '%s:%s' % ('127.0.0.1', '8080'),
        'workers': 1,
        'worker_class': 'gthread',
        'threads': 3,
        'worker_connections': 4,
    }

But now I have tested using Linux. Are you running MacOS?

sylt avatar Jul 02 '24 12:07 sylt

@sylt oh yes, apologies, I was testing with MacOS

when trying on linux it works just fine as you mentioned with a huge concurrency :)

however in MacOS I see this with "just" 100 concurrent clients

 ab -n100 -c100 http://0.0.0.0:8080/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 0.0.0.0 (be patient)...apr_socket_recv: Connection reset by peer (54)
Total of 1 requests completed

on the server I see this:

[2024-07-02 17:09:20 +0200] [45416] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/.../gunicorn/gunicorn/arbiter.py", line 609, in spawn_worker
    worker.init_process()
  File "/.../gunicorn/gunicorn/workers/gthread.py", line 115, in init_process
    super().init_process()
  File "/.../gunicorn/gunicorn/workers/base.py", line 142, in init_process
    self.run()
  File "/.../gunicorn/gunicorn/workers/gthread.py", line 205, in run
    self.set_accept_enabled(new_connections_still_accepted)
  File "/.../gunicorn/gunicorn/workers/gthread.py", line 133, in set_accept_enabled
    method(sock, event, self.accept)
  File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 261, in modify
    key = self.register(fileobj, events, data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 518, in register
    key = super().register(fileobj, events, data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 236, in register
    raise ValueError("Invalid events: {!r}".format(events))
ValueError: Invalid events: 0

UPDATE: with less concurrency like 10 it works fine

let me know if I can help further with this

javiertejero avatar Jul 02 '24 15:07 javiertejero

Thanks for confirming @javiertejero ! Based on your stack trace (thank you very much!) I think I have corrected the error. If you would have the ability to test the latest patch set, I would be more than grateful!

The error was my usage of DefaultSelector().register, which has different behavior depending on if one is running Linux (which will use the EpollSelector) or Mac (which uses KqueueSelector). The former accepts to have events as set as "0" (meaning no events), while the latter doesn't.

The fix is to try and not be smart, but simply use DefaultSelector().unregister when we're not interested in accepting new connections any more :) This way, we won't rely on any implementation specific behavior. Perhaps even the code got a bit clearer...

sylt avatar Jul 03 '24 20:07 sylt

@sylt I don't think we are allowed to call DefaultSelector.unregister twice in a row (1. bottom of loop 2. immediately after)

pajod avatar Jul 03 '24 21:07 pajod

@pajod Ah, yes, you're right! We could get in that situation if the server is terminated right at the time when it's very busy. I've tried to address it in the latest patch. Thanks!

sylt avatar Jul 03 '24 21:07 sylt

Sorry, does not merge cleanly because of my #3189 - fix is to apply those changes like this:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index 49946d77..196759b8 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -1 +0,0 @@
-# -*- coding: utf-8 -
@@ -34 +33 @@ from ..http import wsgi
-class TConn(object):
+class TConn:
@@ -145 +144 @@ class ThreadWorker(base.Worker):
-        except EnvironmentError as e:
+        except OSError as e:
@@ -264 +263 @@ class ThreadWorker(base.Worker):
-        except EnvironmentError as e:
+        except OSError as e:
@@ -318 +317 @@ class ThreadWorker(base.Worker):
-        except EnvironmentError:
+        except OSError:
@@ -329 +328 @@ class ThreadWorker(base.Worker):
-                except EnvironmentError:
+                except OSError:

pajod avatar Aug 14 '24 20:08 pajod

I think I have corrected the error

sorry for the delay, just tested in macosx again and it works now, thanks @sylt

javiertejero avatar Aug 22 '24 23:08 javiertejero

having this merged and included in a published version would be awesome. @benoitc Any chance it happens in a (somewhat) near future?

Thanks

douardda avatar Jan 09 '25 11:01 douardda

having this merged and included in a published version would be awesome. @benoitc Any chance it happens in a (somewhat) near future?

Thanks

@benoitc , putting this here to get this into your review list

cc: @sylt @javiertejero @pajod

GoelJatin avatar Sep 03 '25 07:09 GoelJatin

I'm infrequently, occasionally, getting deadlocks with "gevent".

This pull request is aimed at improving "gthread". It only modifies gunicorn/workers/gthread.py

Could any of the improvements/techniques here also apply to other worker_classes?

sdarwin avatar Nov 16 '25 21:11 sdarwin