Race condition in sync implemention delays closing connections
When this line:
https://github.com/python-websockets/websockets/blob/8c9f6fc27bb54610f1c389d7769cddc1504ca57e/src/websockets/sync/connection.py#L814
runs before this line:
https://github.com/python-websockets/websockets/blob/8c9f6fc27bb54610f1c389d7769cddc1504ca57e/src/websockets/sync/client.py#L384
closing the socket doesn't interrupt recv() because recv() isn't running in the main thread — and interruption is based on signals which are only received by the main thread.
First discovered in #1592.
Hi,
I ran some tests, and I think that by running the WS client itself in a secondary thread (not the one handling interrupts), this race condition can be avoided. Here's an example client code:
import threading
import time
from websockets import ConnectionClosedOK, State, WebSocketException
from websockets.sync.client import connect, ClientConnection
class MyGlobals:
ws_client: ClientConnection | None = None
def client() -> None:
print("client: started")
ws_client = connect("ws://localhost:8765", close_timeout=1)
# make the ClientConnection instance accessible from outside
MyGlobals.ws_client = ws_client
with ws_client:
ws_client.send("hello")
while True:
try:
msg = ws_client.recv()
print(f"client: received '{msg}'")
except ConnectionClosedOK as exc:
print("client: connection closed\n", exc)
except WebSocketException as exc:
print("client: websocket error:\n", exc)
if ws_client.state != State.OPEN:
break
print("client: done")
def main() -> None:
print("main: started")
client_thread = threading.Thread(target=client)
client_thread.start()
try:
while True:
time.sleep(3600)
except KeyboardInterrupt:
print("\nmain: received KeyboardInterrupt")
if MyGlobals.ws_client is not None:
print("main: closing connection")
# MyGlobals.ws_client.socket.close() # case 1
# MyGlobals.ws_client.close_socket() # case 2
MyGlobals.ws_client.close(reason="leaving") # case 3 (exit cleanly)
client_thread.join()
print("main: done")
if __name__ == "__main__":
main()
The delays and errors were different depending on which way I chose to close the connection from the client side.
Case 1: closing with ClientConnection.socket.close()
With an echo server on the other end, when I try to interrupt the script (Ctrl-C), it hangs for several seconds, then throws EOFError (maybe from the keepalive thread? I'm not sure). Below is the example when closing with .socket.close():
main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
unexpected internal error
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 835, in recv_events
self.protocol.receive_data(data)
File "/usr/local/lib/python3.12/site-packages/websockets/protocol.py", line 267, in receive_data
self.reader.feed_data(data)
File "/usr/local/lib/python3.12/site-packages/websockets/streams.py", line 129, in feed_data
raise EOFError("stream ended")
EOFError: stream ended
client: websocket error:
no close frame received or sent
client: done
main: done
and the server side thread throws ConnectionClosedError:
connection handler failed
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/websockets/sync/server.py", line 609, in conn_handler
handler(connection)
File "/testws/example_server.py", line 9, in echo
for message in websocket:
^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 242, in __iter__
yield self.recv()
^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/websockets/sync/connection.py", line 336, in recv
raise self.protocol.close_exc from self.recv_exc
websockets.exceptions.ConnectionClosedError: no close frame received or sent
Case 2: closing with ClientConnection.close_socket()
In the above script, if I use ws_client.close_socket() instead, it exits immediately. So, this is a better option. However, connection loops exit with exceptions.
Client output:
main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
client: websocket error:
no close frame received or sent
client: done
main: done
The server-side thread exits with the same error as in Case 1.
Case 3: closing with ClientConnection.close()
This is the cleanest one: the client closes the connection cleanly and exits without delays. Client output:
main: started
client: started
client: received 'hello'
^C
main: received KeyboardInterrupt
main: closing connection
client: connection closed
sent 1000 (OK) leaving; then received 1000 (OK) leaving
client: done
main: done
And no errors on the server side.
This race condition affects the opening handshake only. Your tests are after the opening handshake so they don't reproduce it.
Also, it's a race condition, meaning that it doesn't happen every time. It depends which one of the two threads (the user thread that called connect and the background thread that manages the TCP connection) resumes running first.
Thanks for clarifying the issue. Indeed, I was testing ways of gracefully closing the WebSockets sync client running in a secondary thread. I noticed the delay (and exceptions) when closing the connection and came here assuming this issue discusses the same situation. Turns out that's not the case.
In my tests, I ran the client in a secondary thread, and did not find an example of terminating such a connection after calling ClientConnection.recv() on the quickstart and FAQ pages.
In case someone stumbles into my example above. I looked further into the docs and saw that ClientConnection.recv() accepts an optional timeout parameter, which gives a way of properly exiting the loop:
# ...
def client(exit_event: threading.Event) -> None: # pass exit_event from the calling (main) thread
# ...
with ws_client:
ws_client.send("hello")
while not exit_event.is_set(): # the calling thread can end this loop with exit_event
try:
try:
a = ws_client.recv(timeout=0.5) # use timeout
except TimeoutError:
continue
print(f"client: received '{a}'")
except ...
I am still curious about this race condition during the handshake, so I will look into the source code to try to understand it.
Unfortunately, I couldn't reproduce this "delay." Before I explain what I did, let me confirm if I understood the bug correctly. When we call connect() (from websockets.sync.client):
- a TCP connection with the server gets established;
- the second thread is initiated when
create_connection()is called at websockets/sync/client.py#L364; - and it reaches the try-except block in websockets/sync/client.py#L377-L386;
- when the handshake fails,
connection.close_socket()is called on line 384; - and this call could get stuck if at the same time the other thread is already awaiting for data at websockets/sync/connection.py#L814
Is this correct? In any case, are there additional details about this bug? E.g., under what kind of conditions it was observed, for how long the script gets stuck (or does it get blocked indefinitely), and does it happen on specific types of connections (TCP, UDS).
My results
Somehow, in my tests blocking/delay does not happen. To trigger the described conditions, in connection.py I put a block right before the self.socket.recv() call:
# inside def recv_events()
my_event.wait() # threading.Event, blocks the thread to ensure socket.recv() is called at the desired moment
print(f"{time.monotonic()} socket.recv() ...")
data = self.socket.recv(self.recv_bufsize)
print(f"{time.monotonic()} socket.recv() done")
# ...
And in client.py, I forced the handshake failure and the socket.recv() call (in another thread) by inserting these lines:
try:
raise Exception # go straight to the exception
connection.handshake(
additional_headers,
user_agent_header,
deadline.timeout(),
)
except Exception:
my_event.set() # unblock the socket thread
time.sleep(1) # just to be sure the other thread calls socket.recv() before the next line
print(f"{time.monotonic()} connection.close_socket() ...")
connection.close_socket()
print(f"{time.monotonic()} connection.close_socket() done")
connection.recv_events_thread.join()
print(f"{time.monotonic()} recv_events_thread joined")
raise
And this is the execution order I get:
424723.400321678 socket.recv() ...
424724.401619214 connection.close_socket() ...
424724.402313408 connection.close_socket() done
424724.402342283 socket.recv() done
424724.402656934 recv_events_thread joined
There is a one-second delay here, because of the inserted time.sleep(1), but I did not see additional delays.
By placing my_event.set() inside close_socket() to see if the race condition can be triggered if some of the lines in close_socket() get executed before the socket.recv() call in another thread, but could trigger a delayed shutdown. In my tests, the blocked socket.recv() always returned right after self.socket.shutdown(socket.SHUT_RDWR) in the client thread.
My env:
- Debian 12;
- Python 3.11 and 3.12;
- I tried with a normal WS server, and a server process that only accepts the connections but does not send any data.