waitress icon indicating copy to clipboard operation
waitress copied to clipboard

100% cpu in mainthread due to not closing properly? (channel.connected == False)

Open djay opened this issue 11 months ago • 93 comments

Following on from debugging in this issue - https://github.com/collective/haufe.requestmonitoring/issues/15

What we see is waitress switching into 100% CPU and staying there. It is happening in production randomly (within a week) and we haven't traced it back to a certain request).

Using a sampling profiler on waitress with 2 threads (in prod) we identified the thread using the CPU as the mainthread (top -H) and this is the profile. Note that since this is prod there are other requests so not all activity is related to the looping causing this bug.

 Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇█▇█▇▇▇▇   MEM 263M ████████    5/5
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3351466     PID:TID 3351466:11          
               Samples 1451365  ⏲️   3'16"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    3'16"     0.0%  100.0%  └─ <module> (/app/parts/instance/bin/interpreter:326)                                                                                       ▒
  00"    3'16"     0.0%  100.0%     └─ <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                    ▒
  00"    3'16"     0.0%  100.0%        └─ main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                     ▒
  00"    3'16"     0.0%  100.0%           └─ run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                   │
  00"    3'16"     0.0%  100.0%              └─ serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                              │
  00"    3'16"     0.0%  100.0%                 └─ serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                       │
  00"    3'16"     0.0%  100.0%                    └─ serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)              │
  00"    3'16"     0.0%  100.0%                       └─ serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                  │
  00"    3'16"     0.0%  100.0%                          └─ run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                  │
  05"    3'16"     2.5%   99.9%                             ├─ loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                           │
  36"     44"     18.3%   22.4%                             │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:158)                                        │
  05"     05"      2.4%    2.4%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                    │
  01"     02"      0.4%    0.9%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:117)                                    │
  01"     01"      0.4%    0.5%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  ├─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:95)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:514)                    │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:515)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:150)                                  │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.0%    0.0%                             │  │  │     └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.1%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:113)                                    │
  00"     00"      0.1%    0.1%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.1%    0.1%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:154)   

from profiling it looks like channel is writable but the channel.connected == False. So then it goes into a loop without writing or closing since it never actually does anything to the socket. https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98

EDIT: My suspicion would be that what started this was a client that shutdown (half) very quickly after a connect and this happened before the dispatcher finished being setup. This causes getpeername to fail with EINVAL and connected = False.

https://github.com/Pylons/waitress/blob/4f6789b035610e0552738cdc4b35ca809a592d48/src/waitress/wasyncore.py#L310

            try:
                self.addr = sock.getpeername()
            except OSError as err:
                if err.args[0] in (ENOTCONN, EINVAL):
                    # To handle the case where we got an unconnected
                    # socket.
                    self.connected = False
                else:
                    # The socket is broken in some unknown way, alert
                    # the user and remove it from the map (to prevent
                    # polling of broken sockets).
                    self.del_channel(map)
                    raise

Could be same issue as https://github.com/Pylons/waitress/issues/411 but hard to tell.

One fix in #419 but could be better ways?

djay avatar Sep 11 '23 10:09 djay

@d-maurer

The error is where self.connected was set tu False. There, it should have been ensured that the corresponding "fileno" is removed von socket_map and that it will not be put there again (as long as self.connected remains False).

Something exceptional must have brought waitress into this state (otherwise, we would have lots of 100 % CPU usage reports). I assume that some bad client has used the system call shutdown to close only part of the socket connection and that waitressdoes not anticipate something like that.

Waitress does seem to properly close if shutdown is received (empty data). see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

So have to keep looking for a way connected can be false but it can still be trying to write. Yes it is most likely bad actors. We get hit by this a lot in our line of business.

djay avatar Sep 11 '23 14:09 djay

Dylan Jay wrote at 2023-9-11 07:56 -0700:

@d-maurer

The error is where self.connected was set tu False. There, it should have been ensured that the corresponding "fileno" is removed von socket_map and that it will not be put there again (as long as self.connected remains False).

Something exceptional must have brought waitress into this state (otherwise, we would have lots of 100 % CPU usage reports). I assume that some bad client has used the system call shutdown to close only part of the socket connection and that waitressdoes not anticipate something like that.

Waitress does seem to properly close if shutdown is received (empty data). see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

I would expect an empty data read, if the sending part of the connection is shut down. However, the client might shut down the receiving part; for me, this does not suggest an empty data read.

d-maurer avatar Sep 11 '23 14:09 d-maurer

kDylan Jay wrote at 2023-9-11 07:56 -0700:

@d-maurer

The error is where self.connected was set tu False. There, it should have been ensured that the corresponding "fileno" is removed von socket_map and that it will not be put there again (as long as self.connected remains False).

Something exceptional must have brought waitress into this state (otherwise, we would have lots of 100 % CPU usage reports). I assume that some bad client has used the system call shutdown to close only part of the socket connection and that waitressdoes not anticipate something like that.

Waitress does seem to properly close if shutdown is received (empty data). see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

I have an idea how to fix the inconsistency without understanding how it came into being: The handle_write_event is (likely) only called by the loop. Thus, if it is called and the connection is closed, it can detect an inconsistency (a closed connection should not get events). It can resolve the inconsistency by deregistration with the loop.

d-maurer avatar Sep 11 '23 16:09 d-maurer

@d-maurer yes it could work to insert a self.del_channel(self) in https://github.com/Pylons/waitress/blob/84360df4c5b4da7c72439bdbe919a84b3c619075/src/waitress/channel.py#L97

djay avatar Sep 11 '23 16:09 djay

@d-mauer I created a PR and it passes the current tests and they hit that line but hard to know how to make a test for this scenario...

https://github.com/Pylons/waitress/pull/419

djay avatar Sep 11 '23 17:09 djay

Dylan Jay wrote at 2023-9-11 10:10 -0700:

@d-mauer I created a PR and it passes the current tests and they hit that line but hard to know how to make a test for this scenario...

https://github.com/Pylons/waitress/pull/419

Some test should not be that difficult: you register a write handler for a connection and then set connected to False. In the next loop run, there will be a write event and you can check that it does not end in an infinite loop.

Of course, the test only verifies that a not connected does not lead to an infinite loop. It does not try to set up a realistic case for setting connected to False.

d-maurer avatar Sep 11 '23 17:09 d-maurer

@mcdonc another solution instead of #419 might be below. is that preferable?

def poll(timeout=0.0, map=None):
    if map is None:  # pragma: no cover
        map = socket_map
    if map:
        r = []
        w = []
        e = []
        for fd, obj in list(map.items()):  # list() call FBO py3
            # prevent getting into a loop for sockets disconnected but not properly closed.
            if obj.check_client_disconnected():
                obj.del_channel()
                continue

perhaps you have a better idea on how it could have got into this knot and the best way to test?

djay avatar Sep 12 '23 01:09 djay

@mcdonc one code path that could perhaps lead to this is

https://github.com/Pylons/waitress/blob/4f6789b035610e0552738cdc4b35ca809a592d48/src/waitress/wasyncore.py#L313

since connecting == False also there doesn't seem to be a way for it to write data out or close?

EDIT: one scenario could be the client half disconnected very quickly before the dispatcher was setup so getpeername fails? but somehow the socket still can be written to?

djay avatar Sep 12 '23 01:09 djay

Looks like it is possible a connection thats been broken before getpeername to then no have any error in select. in the case where there is nothing to read since that will result in a close. https://stackoverflow.com/questions/13257047/python-select-does-not-catch-broken-socket. not sure how it has something write in that case. maybe shutdown for readonly very quickly?

EDIT: https://man7.org/linux/man-pages/man3/getpeername.3p.html

"EINVAL The socket has been shut down." <- so looks like shutdown for read very quickly seems possible to create this tight loop.

djay avatar Sep 12 '23 04:09 djay

or somethow the getpeername is invalid and that results in a oserror. and there is nothing to read but something to write. but I'm not sure if that results in the EINVAL or not.

djay avatar Sep 12 '23 04:09 djay

Dylan Jay wrote at 2023-9-11 18:35 -0700:

@mcdonc another solution instead of #419 might be below. is that preferable?

def poll(timeout=0.0, map=None):
   if map is None:  # pragma: no cover
       map = socket_map
   if map:
       r = []
       w = []
       e = []
       for fd, obj in list(map.items()):  # list() call FBO py3
           # prevent getting into a loop for sockets disconnected but not properly closed.
           if obj.check_client_disconnected():
               obj.del_channel()
               continue

Not sure. It, too, uses del_channel suggesting "delete channel". While "delete channel" likely removes the channel from the socket map, it likely does more than that -- and maybe things not best placed into poll.

Another point: the map may contain objects without del_channel (e.g. non channels) and check_client_disconnected. In those cases, the code above would raise an exception and bring your service down.

Currently, I favor the following reasoning:

You have detected that a write event for a "non connected" connection leads to a busy loop. We must do something about it (either prevent the case from happening or clean up locally).

For the local cleanup, I proposed to unregister the "write event handler". You have translated this into a call to del_channel, likely because the loop does not support "write event handler"s. The loop API only supports the registration of objects (with fileno, readable and writable, ... methods); those objects' writable method indicates whether the object is interested in write event notifications.

I propose to implement "unregister the write event handler" not by a del_channel call but by a modification of writable: ensure it returns false for a non connected connection.

In the long run waitress should likely change its "connected" concept. HTTP is based on TCP which implements bidirectional communication channels. The shutdown system call allows applications to shut down individual directions. This is not compatible with a boolean "connected", instead we have 4 connection states: (fully) disconnected, read connected, write connected and (fully) connected.

I do not know whether important HTTP clients exist which use shutdown. Some might want to signal "no more input" by shutting down the outgoing communication channel, e.g. for a long download.

perhaps you have a better idea on how it could have got into this knot

My primary thought: It must be something exceptional which caused the situation. Otherwise, lots of 100 % CPU usage reports should have been seen.

Initially, I thought of a use of shutdown to (only) partially shut down the socket. Likely, such a use is unanticipated and may reveal bugs.

Meanwhile, I can think of other causes: Maybe, when the connection is closed while the response is beeing streamed back, an inconsistency can creep in. The "async" world should be quite robust against such types of problems (because all IO logic is serialized) but the system is connected to a network and important network events (e.g. the closing of a communication channel) can happen at any time and cause unanticipated exceptions changing the normal control flow.

and the best way to test?

We have here the case that we cannot reproduce the real situation (because we do not know how the inconsistency was introduced). We therefore only implement a workaround.

My favorite workaround (ensure "writable" returns "False" when "not connected") is so simple that no test is necessary to convince us that the busy loop is avoided.

The workaround may not solve all problems. For example, it may keep a socket in use which should in principle have been closed.

d-maurer avatar Sep 12 '23 06:09 d-maurer

Dylan Jay wrote at 2023-9-11 21:59 -0700:

or somethow the getpeername is invalid and that results in a oserror. and there is nothing to read but something to write. but I'm not sure if that results in the EINVAL or not.

I do not think that this goes into the right direction: waitress is an HTTP server and (unlike e.g. a telnet server) it produces output only after it has received input. Thus, I expect that the problematic connection has once been in a connected state (to read something and then to produce output).

d-maurer avatar Sep 12 '23 06:09 d-maurer

@d-maurer I'm fairly sure I have one solid explanation how this could occur.

Outlined in this test - https://github.com/Pylons/waitress/pull/419/files#diff-5938662f28fcbb376792258701d0b6c21ec8a1232dada6ad2ca0ea97d4043d96R775

NOTE: I haven't worked out a way to detect the looping in a test yet. So the assert at the end is not correct.

It is as you say. There is a shutdown of the read only but this is a race condition. it has to happen before the dispatcher is created so right after the connect. I've confirmed this results in an getpeername returning OSError EINVAL and thus connected = False and the select still thinks it can write so the loop will be inifinite. or maybe until the bad actor properly closes the connection. not sure on that one.

In the long run waitress should likely change its "connected" concept. HTTP is based on TCP which implements bidirectional communication channels. The shutdown system call allows applications to shut down individual directions. This is not compatible with a boolean "connected", instead we have 4 connection states: (fully) disconnected, read connected, write connected and (fully) connected.

true. but if I'm right on the cause of this this, the socket would never have connected=False with most shutdowns. Only when it happens too quickly. That flag is mostly used to indicate not yet connected or in the process of closing.

My favorite workaround (ensure "writable" returns "False" when "not connected") is so simple that no test is necessary to convince us that the busy loop is avoided.

yes that will also work. I'll switch it to that. There is a system to remove inactive sockets so I guess that would get them closed eventually. I'm not really sure the pros and cons of having sockets left open vs the consequences of just closing them for this case (I tried this. it also worked in terms of the tests).

djay avatar Sep 12 '23 10:09 djay

@d-maurer I pushed new code that uses writable instead.

djay avatar Sep 12 '23 11:09 djay

Dylan Jay wrote at 2023-9-12 03:59 -0700:

... It is as you say. There is a shutdown of the read only but this is a race condition. it has to happen before the dispatcher is created so right after the connect. I've confirmed this results in an getpeername returning OSError EINVAL and thus connected = False and the select still thinks it can write so the loop will be inifinite.

I do not know waitress implementation details BUT in general, write notications are called for only AFTER output has been generated (i.e. writable will only return True once data to be written has been generated).

As explained earlier, an HTTP server first reads data from a connection before it writes to the connection.

If you are right with your assumption above, then reading has been possible (despite a "not connected") and output was generated based in this input.

or maybe until the bad actor properly closes the connection. not sure on that one.

The connection's writeable must also return True (otherwise, the corresponding fd will not be included in writefs). Usually, this would happen if it is known that there is data to be output.

d-maurer avatar Sep 12 '23 12:09 d-maurer

@d-maurer maybe a core contributor can step in and advise the best solution and test. @digitalresistor @kgaughan ?

djay avatar Sep 13 '23 03:09 djay

Dylan Jay wrote at 2023-9-12 20:02 -0700:

@d-maurer maybe a core contributor can step in and advise the best solution and test. @digitalresistor @kgaughan ?

I had a closer look at the code and I think I found a realistic scenario to enter the busy loop state: If HTTPChannel.read reads empty data, it sets connected to False; if there is pending output at that time we are in the busy loop state.

We can force HTTPChannel.read to read empty data by letting the HTTP client shutdown its sending direction. Once all data has been read by the receiving site, its next recv will return empty data. A normal close (rather than shutdown) might have a similar effect.

The hypothesis can be checked in the following way: Design an HTTP service to produce sufficient output to saturate the output channel. Design an HTTP client: it sends a request to the service (but does not read the response), waits sufficiently long such that the service has produced its output, then shuts down the writing direction of its HTTP connection (maybe just closes its HTTP connection). Check whether this has brings waitress into the busy loop state.

d-maurer avatar Sep 13 '23 05:09 d-maurer

@d-maurer that was my initial thought but as I pointed out in https://github.com/Pylons/waitress/issues/418#issuecomment-1714057512 recv in wasynccore will do handle_close on getting empty data and take it out of the map so I couldn't see any way for no bytes being sent to cause this loop.

    def recv(self, buffer_size):
        try:
            data = self.socket.recv(buffer_size)
            if not data:
                # a closed connection is indicated by signaling
                # a read condition, and having recv() return 0.
                self.handle_close()
                return b""
            else:
                return data
        except OSError as why:
            # winsock sometimes raises ENOTCONN
            if why.args[0] in _DISCONNECTED:
                self.handle_close()
                return b""
            else:
                raise

djay avatar Sep 13 '23 07:09 djay

Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data. So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that.

djay avatar Sep 13 '23 07:09 djay

Dylan Jay wrote at 2023-9-13 00:07 -0700:

@d-maurer that was my initial thought but as I pointed out in https://github.com/Pylons/waitress/issues/418#issuecomment-1714057512 recv in wasynccore will do handle_close on getting empty data and take it out of the map as I couldn't see any way for no bytes being sent to cause this loop.

You are right! I missed (had forgotten) that.

d-maurer avatar Sep 13 '23 07:09 d-maurer

Dylan Jay wrote at 2023-9-13 00:11 -0700:

Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data.

A select will almost always report a possible write. (For a "normal" socket) the only exception is that the write buffer is satuarated. That's why the writeable must return False unless there is data to write (or the handle_write will be able to clean up the state).

So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that.

Only, if waitress defines its writable in a non standard way: typically, writable would only return True if output was pending.

In channel.py writable is defined as:

        return self.total_outbufs_len or self.will_close or self.close_when_flushed

Thus, it is not completely standard. However, as far as I could see, will_close and close_when_flushed can only be set during request processing, i.e. after input has been received.

d-maurer avatar Sep 13 '23 07:09 d-maurer

Dylan Jay wrote at 2023-9-13 00:07 -0700:

@d-maurer that was my initial thought but as I pointed out in https://github.com/Pylons/waitress/issues/418#issuecomment-1714057512 recv in wasynccore will do handle_close on getting empty data and take it out of the map as I couldn't see any way for no bytes being sent to cause this loop.

I have meanwhile read the Python socket HOWTO (--> "https://docs.python.org/3/howto/sockets.html#socket-howto"). It recommends (in the "Disconnecting" section) to operate in an HTTP-like exchange: send the request and then use shutdown(1) to indicate "I (the client) will produce no more output but am still ready for input".

The behavior of waitress you point out above (close as soon as there is no more input) will not play well with this recommendation.

d-maurer avatar Sep 13 '23 08:09 d-maurer

@d-maurer that would be a different bug in waitress.

My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

djay avatar Sep 13 '23 11:09 djay

Dylan Jay wrote at 2023-9-13 04:26 -0700:

... My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

Would you share the version of waitress you are observing the behavior?

d-maurer avatar Sep 13 '23 11:09 d-maurer

Dieter Maurer wrote at 2023-9-13 09:57 +0200:

Dylan Jay wrote at 2023-9-13 00:11 -0700:

Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data.

A select will almost always report a possible write. (For a "normal" socket) the only exception is that the write buffer is satuarated. That's why the writeable must return False unless there is data to write (or the handle_write will be able to clean up the state).

So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that.

Only, if waitress defines its writable in a non standard way: typically, writable would only return True if output was pending.

In channel.py writable is defined as:

       return self.total_outbufs_len or self.will_close or self.close_when_flushed

Thus, it is not completely standard. However, as far as I could see, will_close and close_when_flushed can only be set during request processing, i.e. after input has been received.

will_close can be set by server.BaseWSGIServer.maintenance, i.e. independent of a task/request. Thus, you might be right with your hypothesis: connected set to false in wasyncore.dispatcher.__init__ due to a connection race condition; later busy loop due to not connected and writable.

You could verify this as follows:

Add a sufficiently large sleep into wasyncore.dispatcher.__init__ before the getpeername call.

Open a connection to the server and immediately close it again.

The sleep should ensure that at the time of the getpeername call, the remote socket end is already closed (maybe, getpeername then fails with an exception and connected is set to False).

Wait sufficiently long (somewhat longer than cleanup_interval) to let maintenance set will_close.

If you are right, this will result in a busy loop.

d-maurer avatar Sep 13 '23 12:09 d-maurer

Dieter Maurer wrote at 2023-9-13 14:26 +0200:

... will_close can be set by server.BaseWSGIServer.maintenance, i.e. independent of a task/request. Thus, you might be right with your hypothesis: connected set to false in wasyncore.dispatcher.__init__ due to a connection race condition; later busy loop due to not connected and writable.

You could verify this as follows:

Add a sufficiently large sleep into wasyncore.dispatcher.__init__ before the getpeername call.

Open a connection to the server and immediately close it again.

The sleep should ensure that at the time of the getpeername call, the remote socket end is already closed (maybe, getpeername then fails with an exception and connected is set to False).

Wait sufficiently long (somewhat longer than cleanup_interval) to let maintenance set will_close.

If you are right, this will result in a busy loop.

On my system (Linux, kernel 5.4.0), getpeername returns the peer address even after the socket's remote end has been closed. Verified via the following interactive code: server code:

from socket import socket, AF_INET, SOCK_STREAM

ss = socket(AF_INET, SOCK_STREAM)
ss.bind(("localhost", 10000))
ss.listen()
cs, addr = ss.accept()
# run the client code in a second interactive session
cs.getpeername()

client code:

from socket import socket, AF_INET, SOCK_STREAM
cs = socket(AF_INET, SOCK_STREAM)
cs.connect(("localhost", 10000))
cs.close()

d-maurer avatar Sep 13 '23 12:09 d-maurer

"EINVAL The socket has been shut down." <- so looks like shutdown for read very quickly seems possible to create this tight loop.

Note that a socket has 2 ends. "The socket has been shut down" might refer to the local (not the remote) end.

d-maurer avatar Sep 13 '23 12:09 d-maurer

@d-maurer that would be a different bug in waitress.

My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

I would highly recommend that you don't run waitress bare on the internet. Best practice us to place it behind a load balancer of some sort.

There are other scenarios in which Waitress does not deal well with certain request patterns depending on what type of content your are serving with waitress (what your app generates, how it generates, how large those responses are). Waitress for example does not deal well with clients that read very slowly if the response is larger than the various buffers that it tries to use internally, thereby allowing a client to hold up an app thread directly (can't pop from the WSGI app when the buffer is full/high water mark is reached).

digitalresistor avatar Sep 14 '23 02:09 digitalresistor

@d-maurer that would be a different bug in waitress. My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

I would highly recommend that you don't run waitress bare on the internet. Best practice us to place it behind a load balancer of some sort.

@digitalresistor of course it's not bare on the internet. There are 4 reverse proxies in front of it. Yet this bug still happens.

djay avatar Sep 14 '23 04:09 djay

@d-maurer you are right. shutdown doesn't seem to make getpeername fail. I've created a test and I can't get getpeername to fail yet.

https://github.com/djay/waitress/blob/5130bec0501567c809397cc3d8003c0af11fb46f/tests/test_server.py#L338

djay avatar Sep 14 '23 04:09 djay