websockets icon indicating copy to clipboard operation
websockets copied to clipboard

Race condition in sync implemention delays closing connections

Open aaugustin opened this issue 10 months ago • 4 comments

When this line:

https://github.com/python-websockets/websockets/blob/8c9f6fc27bb54610f1c389d7769cddc1504ca57e/src/websockets/sync/connection.py#L814

runs before this line:

https://github.com/python-websockets/websockets/blob/8c9f6fc27bb54610f1c389d7769cddc1504ca57e/src/websockets/sync/client.py#L384

closing the socket doesn't interrupt recv() because recv() isn't running in the main thread — and interruption is based on signals which are only received by the main thread.

First discovered in #1592.

aaugustin avatar Feb 16 '25 10:02 aaugustin

Hi,

I ran some tests, and I think that by running the WS client itself in a secondary thread (not the one handling interrupts), this race condition can be avoided. Here's an example client code:

import threading
import time
from websockets import ConnectionClosedOK, State, WebSocketException
from websockets.sync.client import connect, ClientConnection

class MyGlobals:
    ws_client: ClientConnection | None = None

def client() -> None:
    print("client: started")
    ws_client = connect("ws://localhost:8765", close_timeout=1)

    # make the ClientConnection instance accessible from outside
    MyGlobals.ws_client = ws_client

    with ws_client:
        ws_client.send("hello")
        while True:
            try:
                msg = ws_client.recv()
                print(f"client: received '{msg}'")
            except ConnectionClosedOK as exc:
                print("client: connection closed\n", exc)
            except WebSocketException as exc:
                print("client: websocket error:\n", exc)

            if ws_client.state != State.OPEN:
                break

    print("client: done")

def main() -> None:
    print("main: started")
    client_thread = threading.Thread(target=client)
    client_thread.start()

    try:
        while True:
            time.sleep(3600)
    except KeyboardInterrupt:
        print("\nmain: received KeyboardInterrupt")

    if MyGlobals.ws_client is not None:
        print("main: closing connection")
        # MyGlobals.ws_client.socket.close()  # case 1
        # MyGlobals.ws_client.close_socket()  # case 2
        MyGlobals.ws_client.close(reason="leaving")  # case 3 (exit cleanly)

    client_thread.join()
    print("main: done")

if __name__ == "__main__":
    main()

The delays and errors were different depending on which way I chose to close the connection from the client side.

Case 1: closing with ClientConnection.socket.close()

With an echo server on the other end, when I try to interrupt the script (Ctrl-C), it hangs for several seconds, then throws EOFError (maybe from the keepalive thread? I'm not sure). Below is the example when closing with .socket.close():

main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
unexpected internal error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 835, in recv_events
    self.protocol.receive_data(data)
  File "/usr/local/lib/python3.12/site-packages/websockets/protocol.py", line 267, in receive_data
    self.reader.feed_data(data)
  File "/usr/local/lib/python3.12/site-packages/websockets/streams.py", line 129, in feed_data
    raise EOFError("stream ended")
EOFError: stream ended
client: websocket error:
 no close frame received or sent
client: done
main: done

and the server side thread throws ConnectionClosedError:

connection handler failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/websockets/sync/server.py", line 609, in conn_handler
    handler(connection)
  File "/testws/example_server.py", line 9, in echo
    for message in websocket:
                   ^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 242, in __iter__
    yield self.recv()
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 336, in recv
    raise self.protocol.close_exc from self.recv_exc
websockets.exceptions.ConnectionClosedError: no close frame received or sent

Case 2: closing with ClientConnection.close_socket()

In the above script, if I use ws_client.close_socket() instead, it exits immediately. So, this is a better option. However, connection loops exit with exceptions.

Client output:

main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
client: websocket error:
 no close frame received or sent
client: done
main: done

The server-side thread exits with the same error as in Case 1.

Case 3: closing with ClientConnection.close()

This is the cleanest one: the client closes the connection cleanly and exits without delays. Client output:

main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
client: connection closed
 sent 1000 (OK) leaving; then received 1000 (OK) leaving
client: done
main: done

And no errors on the server side.

darkhaniop avatar Apr 04 '25 12:04 darkhaniop

This race condition affects the opening handshake only. Your tests are after the opening handshake so they don't reproduce it.

Also, it's a race condition, meaning that it doesn't happen every time. It depends which one of the two threads (the user thread that called connect and the background thread that manages the TCP connection) resumes running first.

aaugustin avatar Apr 05 '25 06:04 aaugustin

Thanks for clarifying the issue. Indeed, I was testing ways of gracefully closing the WebSockets sync client running in a secondary thread. I noticed the delay (and exceptions) when closing the connection and came here assuming this issue discusses the same situation. Turns out that's not the case.

In my tests, I ran the client in a secondary thread, and did not find an example of terminating such a connection after calling ClientConnection.recv() on the quickstart and FAQ pages.

In case someone stumbles into my example above. I looked further into the docs and saw that ClientConnection.recv() accepts an optional timeout parameter, which gives a way of properly exiting the loop:

# ...
def client(exit_event: threading.Event) -> None:  # pass exit_event from the calling (main) thread
    # ...
    with ws_client:
        ws_client.send("hello")
        while not exit_event.is_set():  # the calling thread can end this loop with exit_event
            try:
                try:
                    a = ws_client.recv(timeout=0.5)  # use timeout
                except TimeoutError:
                    continue
                print(f"client: received '{a}'")
            except ...

I am still curious about this race condition during the handshake, so I will look into the source code to try to understand it.

darkhaniop avatar Apr 05 '25 11:04 darkhaniop

Unfortunately, I couldn't reproduce this "delay." Before I explain what I did, let me confirm if I understood the bug correctly. When we call connect() (from websockets.sync.client):

Is this correct? In any case, are there additional details about this bug? E.g., under what kind of conditions it was observed, for how long the script gets stuck (or does it get blocked indefinitely), and does it happen on specific types of connections (TCP, UDS).

My results

Somehow, in my tests blocking/delay does not happen. To trigger the described conditions, in connection.py I put a block right before the self.socket.recv() call:

# inside def recv_events()
my_event.wait()  # threading.Event, blocks the thread to ensure socket.recv() is called at the desired moment
print(f"{time.monotonic()} socket.recv() ...")
data = self.socket.recv(self.recv_bufsize)
print(f"{time.monotonic()} socket.recv() done")
# ...

And in client.py, I forced the handshake failure and the socket.recv() call (in another thread) by inserting these lines:

    try:
        raise Exception  # go straight to the exception
        connection.handshake(
            additional_headers,
            user_agent_header,
            deadline.timeout(),
        )
    except Exception:
        my_event.set()  # unblock the socket thread
        time.sleep(1)  # just to be sure the other thread calls socket.recv() before the next line
        print(f"{time.monotonic()} connection.close_socket() ...")
        connection.close_socket()
        print(f"{time.monotonic()} connection.close_socket() done")
        connection.recv_events_thread.join()
        print(f"{time.monotonic()} recv_events_thread joined")
        raise

And this is the execution order I get:

424723.400321678 socket.recv() ...
424724.401619214 connection.close_socket() ...
424724.402313408 connection.close_socket() done
424724.402342283 socket.recv() done
424724.402656934 recv_events_thread joined

There is a one-second delay here, because of the inserted time.sleep(1), but I did not see additional delays.

By placing my_event.set() inside close_socket() to see if the race condition can be triggered if some of the lines in close_socket() get executed before the socket.recv() call in another thread, but could trigger a delayed shutdown. In my tests, the blocked socket.recv() always returned right after self.socket.shutdown(socket.SHUT_RDWR) in the client thread.

My env:

  • Debian 12;
  • Python 3.11 and 3.12;
  • I tried with a normal WS server, and a server process that only accepts the connections but does not send any data.

darkhaniop avatar Apr 05 '25 17:04 darkhaniop