websockets icon indicating copy to clipboard operation
websockets copied to clipboard

Untracable print "SSL connection is closed" and other attempts at decluttering stdout

Open aliqandil opened this issue 2 years ago • 7 comments

First of all, thanks for this package, It very fast and robust, I switched from a c++ code nchan.io to this, and had no regrets or memory leaks! I feel it's even faster!

OK so about the issue, this is an odd one: I just get "SSL connection is closed" printed and I can't find where it's coming from! I searched the phrase in this repository, and also set the server's log level to critical, no use!

So the best thing I can provide to help find this, is the environment I'm using, I'm running a 16 core production server with 8 workers that inherit the socket and ssl objects from their parent. With sometimes 100k simultaneous connections, I get all kinds of weird error messages. My intent is to just minimize the clutter I get so I can see if anything actually actionable went wrong with the server. Clients sending random junk packets that makes no sense, is not something I can do anything about but to ignore! I once even got a Invalid OPCODE 7 Error, which I didn't save it time to find a way to catch it next time. So if you could also catch meaningless opcodes and turn them to protocol errors might fix this!

Anyway, this is part of the protocol class I'm using to reduce the clutter I get:

import zlib, ssl, websockets
class Connection(websockets.server.WebSocketServerProtocol):

	### Patch Start!
	async def close( self, *args, **kwargs ):
		try:
			await super().close( *args, **kwargs )
		except AttributeError as e:
			# Catches about 20% of the errors, It can't find the "transfer_data_task" AttributeError on close for some reason!
			print( f"[{os.getpid()}] WARNING: failed closing connection : {e}" )
			await self.close_connection()

	async def read_message( self ):
		try:
			return await super().read_message()
		except zlib.error as e:
			# This is the more common one, Catches about 70% of the errors
			# Related Issue: https://github.com/aaugustin/websockets/issues/665
			# Example: Error -3 while decompressing data: invalid distance too far back
			raise websockets.exceptions.ProtocolError(f"uncaught zlib.error!")
			
		except ssl.SSLError as e:
			# Catches about 10% of the errors, also catches different kinds:
			# Example: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548)
			# Example: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2548)
			# Example: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:2548)
			# Example: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2548)
			raise websockets.exceptions.ProtocolError(f"uncaught ssl.SSLError!") 
	### Patch Over.

So In conclusion: I have found 4 possible scenarios that probably only happen when lots of connections from random parts of the world are connecting to the server, 2 of which I managed to catch and rebrand as a ProcotolError, and the transfer_data_task one could probably be ignored? The other one (Invalid OPCODE) was so rare that I can't really test how I should catch it correctly! Maybe it should work the same and can be rebranded as a ProtocolError? and most importantly, where is the "SSL connection is closed" coming from!? can I even catch it? Is using a contextlib.redirect_stdout my only option?

So I'd be happy to hear your thoughts on this, Could these be considered improvement that can be added to the main project, or am I doing something very wrong here? :)

aliqandil avatar Apr 14 '22 02:04 aliqandil

It is possible that f761e23 will reduce AttributeError related to transfer_data_task, if it happened when shutting down the server. If you have a stack trace for that error around, I could confirm.

aaugustin avatar Apr 17 '22 08:04 aaugustin

One possibility for zlib.error would be: a buggy client negotiates client_max_window_bits but then doesn't honor it i.e. uses a larger compression window than what it declared. This would also be possible, albeit less likely, with client_no_context_takeover.

Are you able to modify the except zlib.error as e: clause to print self.request_headers and self.response_headers when the error happens? Then we can tell:

  • how compression was negotiated
  • if a specific user agent is affected

Unfortunately I don't have a high-traffic production server where this happens so I need your help to reproduce and diagnose.

aaugustin avatar Apr 17 '22 08:04 aaugustin

The ssl.SSLError look like I/O errors; I think I could add them to the clause handling ConnectionError, TimeoutError, EOFError.

aaugustin avatar Apr 17 '22 08:04 aaugustin

I just pushed some changes to ignore errors that we can't do much about anyway.

So, on the four cases you reported, the situation is the following:

  1. AttributeError on transfer_data_task ==> might be fixed if it was happening when stopping / restarting server processes; almost certainly not fixed if it was happening in another scenario; need a stack track to understand in which circumstances it happens 🤔
  2. zlib.error: Error -3 while decompressing data: invalid distance too far back ==> converted to a ProtocolError; no longer logged ☑️
  3. ssl.SSLError with a variety of messages ==> treated like I/O errors; no longer logged ☑️
  4. invalid opcode ==> this one should result in a ProtocolError which isn't logged so I'm not sure how you saw it; I would need more information.

aaugustin avatar Apr 17 '22 10:04 aaugustin

So since I've patched a bunch of these, I'll have to wait for a good chance to reset the server and comment the patches, but I did get the first one for now:

Task exception was never retrieved
future: <Task finished name='Task-737694' coro=<Connection.close() done, defined at /srv/Backend/Engine/protocol.py:29> exception=AttributeError("'Connection' object has no attribute 'transfer_data_task'")>
Traceback (most recent call last):
  File "/srv/Backend/Engine/protocol.py", line 31, in close
    await super().close( *args, **kwargs )
  File "/srv/Backend/.env/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 771, in close
    self.transfer_data_task,
AttributeError: 'Connection' object has no attribute 'transfer_data_task'

aliqandil avatar Apr 17 '22 11:04 aliqandil

OK, I suspect I haven't fixed this case. Can you provide the context of /srv/Backend/Engine/protocol.py around line 29?

aaugustin avatar Apr 17 '22 14:04 aaugustin

The code I've provided, is the relative part in "protocol.py" file, and line 29 would be this line async def close( self, *args, **kwargs ):

Also I let the server run with the zlib debuggin on for 24 hours, this is what I managed to fish out:

[1103346] WARNING: failed reading message: Error -3 while decompressing data: invalid distance too far back UNIQUEIP_1 Host: example.com
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36
Upgrade: websocket
Origin: https://example.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: CnMP4d2UlB41PCGY6y9J1w==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

 Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: hzyvvOOmxQaiEbstRPSkk+Kat54=
Sec-WebSocket-Extensions: permessage-deflate; server_max_window_bits=12; client_max_window_bits=12
Date: Mon, 18 Apr 2022 02:28:34 GMT
Server: Python/3.10 websockets/10.2


[1103346] WARNING: failed reading message: Error -3 while decompressing data: invalid distance too far back UNIQUEIP_2 Host: example.com
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36
Upgrade: websocket
Origin: https://example.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: jO01HSKuaYAOFtQAQubyUg==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

 Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: gb2hyaH6VKbNpHgf08PZueWLifY=
Sec-WebSocket-Extensions: permessage-deflate; server_max_window_bits=12; client_max_window_bits=12
Date: Mon, 18 Apr 2022 02:29:05 GMT
Server: Python/3.10 websockets/10.2

[1103229] WARNING: failed reading message: Error -3 while decompressing data: invalid distance too far back UNIQUEIP_3 Host: example.com
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Upgrade: websocket
Origin: https://example.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: vyPFE+623gHEB29SWzsrmg==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

 Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: 9ZYXCdAoZksCXCaXbJRIRUs5iE0=
Sec-WebSocket-Extensions: permessage-deflate; server_max_window_bits=12; client_max_window_bits=12
Date: Mon, 18 Apr 2022 03:15:12 GMT
Server: Python/3.10 websockets/10.2


[1103244] WARNING: failed reading message: Error -3 while decompressing data: invalid distance too far back UNIQUEIP_4 Host: example.com
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36
Upgrade: websocket
Origin: https://example.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: toUOLsrGgyoPNj5ahw67gA==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

 Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: leRomCUMc2rRYy+4Dyzh3mpCX2Y=
Sec-WebSocket-Extensions: permessage-deflate; server_max_window_bits=12; client_max_window_bits=12
Date: Mon, 18 Apr 2022 06:44:06 GMT
Server: Python/3.10 websockets/10.2

This message appeared a bunch of time for each UNIQUEIP_4, So I just included only one output for each IP. It looks like the huge clutter is caused by just a handful of connections!

BTW, I've actually reported 5 cases, there is also the random output that I sometimes get saying: "SSL connection is closed"

aliqandil avatar Apr 18 '22 14:04 aliqandil

Based on the user agent, compression errors happen on connections coming from Google Chrome. (Unless this is something else masquerading as Google Chrome?) I don't see how we could debug that further :-( So let's stick to ignoring these errors.

aaugustin avatar Oct 09 '22 14:10 aaugustin

Regarding AttributeError: 'Connection' object has no attribute 'transfer_data_task', this happens only if WebSocketServerProtocol.close (where the error happens in you stack trace) is called before WebSocketServerProtocol.connection_open (where the attribute is set and it is never deleted).


I can find only one place where websockets calls WebSocketServerProtocol.close, around line 245 in websockets/legacy/server.py:

            try:
                await self.close()
            except ConnectionError:
                raise
            except Exception:
                self.logger.error("closing handshake failed", exc_info=True)
                raise

I don't see how you can get there without going through WebSocketServerProtocol.handshake, which calls WebSocketServerProtocol.connection_open.

If this is where the error comes from, then you should have "closing handshake failed" and the full stack trace (which I need here) in logs. You can bump that particular logger call to "critical" to minimize log noise.


By chance, does your code do something like asyncio.create_task(connection.close()) on a connection that isn't open or won't open because the handshake fails? That would explain why your stack trace stops at connection.close() and the results you're seeing here.

aaugustin avatar Oct 09 '22 14:10 aaugustin

Regarding SSL connection is closed I think it comes from this line in asyncio:

https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/asyncio/sslproto.py#L682

This should remove it:

logging.getLogger("asyncio").setLevel(logging.ERROR)

aaugustin avatar Oct 09 '22 16:10 aaugustin