workers/gthread: Remove locks + one event queue + general cleanup
The main purpose is to remove complexity from gthread by:
-
Removing the lock for handling self._keep and self.poller. This is possible since we now do all such manipulation on the main thread instead. When a connection is done, it posts a callback through the PollableMethodCaller which gets executed on the main thread.
-
Having a single event queue (self.poller), as opposed to also managing a set of futures. This fixes #3146 (although there are more minimal ways of doing it).
There are other more minor things as well:
- Renaming some variables, e.g. self._keep to self.keepalived_conns.
- Remove self-explanatory comments (what the code does, not why).
- Remove fiddling with connection socket block/not blocked.
Some complexity has been added to the shutdown sequence, but hopefully for good reason: it's to make sure that all already accepted connections are served within the grace period.
If it is so, then that doesn't sound good @javiertejero -- I would have thought the issue to be fixed! I can't reproduce any hanging behavior with while ab -n1000 -c100 http://0.0.0.0:8080/; do :; done running for a few minutes with the suggested pull request. However, on master I have no problem getting the error.
I'm testing the standalone app using the options:
options = {
'bind': '%s:%s' % ('127.0.0.1', '8080'),
'workers': 1,
'worker_class': 'gthread',
'threads': 3,
'worker_connections': 4,
}
But now I have tested using Linux. Are you running MacOS?
@sylt oh yes, apologies, I was testing with MacOS
when trying on linux it works just fine as you mentioned with a huge concurrency :)
however in MacOS I see this with "just" 100 concurrent clients
ab -n100 -c100 http://0.0.0.0:8080/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)...apr_socket_recv: Connection reset by peer (54)
Total of 1 requests completed
on the server I see this:
[2024-07-02 17:09:20 +0200] [45416] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/.../gunicorn/gunicorn/arbiter.py", line 609, in spawn_worker
worker.init_process()
File "/.../gunicorn/gunicorn/workers/gthread.py", line 115, in init_process
super().init_process()
File "/.../gunicorn/gunicorn/workers/base.py", line 142, in init_process
self.run()
File "/.../gunicorn/gunicorn/workers/gthread.py", line 205, in run
self.set_accept_enabled(new_connections_still_accepted)
File "/.../gunicorn/gunicorn/workers/gthread.py", line 133, in set_accept_enabled
method(sock, event, self.accept)
File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 261, in modify
key = self.register(fileobj, events, data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 518, in register
key = super().register(fileobj, events, data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.pyenv/versions/3.11.7/lib/python3.11/selectors.py", line 236, in register
raise ValueError("Invalid events: {!r}".format(events))
ValueError: Invalid events: 0
UPDATE: with less concurrency like 10 it works fine
let me know if I can help further with this
Thanks for confirming @javiertejero ! Based on your stack trace (thank you very much!) I think I have corrected the error. If you would have the ability to test the latest patch set, I would be more than grateful!
The error was my usage of DefaultSelector().register, which has different behavior depending on if one is running Linux (which will use the EpollSelector) or Mac (which uses KqueueSelector). The former accepts to have events as set as "0" (meaning no events), while the latter doesn't.
The fix is to try and not be smart, but simply use DefaultSelector().unregister when we're not interested in accepting new connections any more :) This way, we won't rely on any implementation specific behavior. Perhaps even the code got a bit clearer...
@sylt I don't think we are allowed to call DefaultSelector.unregister twice in a row (1. bottom of loop 2. immediately after)
@pajod Ah, yes, you're right! We could get in that situation if the server is terminated right at the time when it's very busy. I've tried to address it in the latest patch. Thanks!
Sorry, does not merge cleanly because of my #3189 - fix is to apply those changes like this:
diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index 49946d77..196759b8 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -1 +0,0 @@
-# -*- coding: utf-8 -
@@ -34 +33 @@ from ..http import wsgi
-class TConn(object):
+class TConn:
@@ -145 +144 @@ class ThreadWorker(base.Worker):
- except EnvironmentError as e:
+ except OSError as e:
@@ -264 +263 @@ class ThreadWorker(base.Worker):
- except EnvironmentError as e:
+ except OSError as e:
@@ -318 +317 @@ class ThreadWorker(base.Worker):
- except EnvironmentError:
+ except OSError:
@@ -329 +328 @@ class ThreadWorker(base.Worker):
- except EnvironmentError:
+ except OSError:
I think I have corrected the error
sorry for the delay, just tested in macosx again and it works now, thanks @sylt
having this merged and included in a published version would be awesome. @benoitc Any chance it happens in a (somewhat) near future?
Thanks
having this merged and included in a published version would be awesome. @benoitc Any chance it happens in a (somewhat) near future?
Thanks
@benoitc , putting this here to get this into your review list
cc: @sylt @javiertejero @pajod
I'm infrequently, occasionally, getting deadlocks with "gevent".
This pull request is aimed at improving "gthread". It only modifies gunicorn/workers/gthread.py
Could any of the improvements/techniques here also apply to other worker_classes?