remi
remi copied to clipboard
Websocket communication error
Hi Davide,
I finally upgraded Remi from 2019.11 to 2020.11.20 some days ago, so that I can benefit from the bugfix of the connection drop we worked on. It seems that many things changed in Remi since one year and it took me less than 10 minutes to adapt my server code (which is kind of huge now), so congratulations for your efforts to make it easy to maintain.
I saw that you worked on the websocket transmissions on both server and client side, and unfortunately I'm facing a new websocket bug that wasn't on 2019.11. It took me hours to reproduce, isolate and finally create a minimal (not-)working example.
Conditions to reproduce:
- The client/server must be linked by a "not too fast" connection. I mean this issue doesn't appear when they are on the same machine, wired network, or even very fast wifi. The server on which I'm running Remi usually for production is 500km from me, it is behind some proxies, and I'm connected through wifi to my router at home : this is enough to reproduce the issue. The best way I found to reproduce finally 100% of the time is to run remi at home, open the remi listening port from the Internet, and connect though 4G.
- The server must send multiple successive websocket messages on a single UI action. The file attached is a basic python script that reproduces well the issue : there is a column (
VBox
) of 100Labels
, and at the top aButton
. The button triggers a function that successively updates the Labels text, hence sending 100 websocket messages in a row. If you don't succeed to reproduce it, increasing from 100 to 1000 the number of Labels should increase the probability of occurrence. - Browser: I tried firefox, chrome, and brave (on my android device), all of them fail.
- I tried ubuntu 18.04 and 19.10, both with remi 2020.11.20.
Symptoms:
- When you initially load the page everything works well, as the whole page is sent in a single message. Then when you press the button, the first messages are ok, then randomly at a certain point the websocket connexion fails. As far as I understood the client browser triggers
_ws.onerror
, making the server to close the connection. On the client side, we see for a short time Remi's "loading folding square". After that, a new websocket is automatically established and the page is fully re-sent to the client. - Log on the server side
remi.request DEBUG App.onpageshow event occurred
remi.server.ws DEBUG send_message: 3... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG on_message: callback
remi.request DEBUG App.onresize event occurred. Width:400 Height:584
remi.server.ws DEBUG send_message: 3... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG on_message: callback
remi.server.ws DEBUG send_message: 1139974414727696,%3Cp%20id%3D%22139974414727696%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974414727632,%3Cp%20id%3D%22139974414727632%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974415166160,%3Cp%20id%3D%22139974415166160%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974415164176,%3Cp%20id%3D%22139974415164176%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974415164944,%3Cp%20id%3D%22139974415164944%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974414919440,%3Cp%20id%3D%22139974414919440%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974414921104,%3Cp%20id%3D%22139974414921104%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
[...]
remi.server.ws DEBUG send_message: 1139974396841040,%3Cp%20id%3D%22139974396841040%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974396363472,%3Cp%20id%3D%22139974396363472%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974396365200,%3Cp%20id%3D%22139974396365200%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974396366800,%3Cp%20id%3D%22139974396366800%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws DEBUG send_message: 1139974396397264,%3Cp%20id%3D%22139974396397264%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.request DEBUG removing websocket instance, communication error with client
remi.server.ws ERROR Error managing incoming websocket message
Traceback (most recent call last):
File "/home/francois/.local/lib/python3.7/site-packages/remi/server.py", line 165, in read_next_message
length = self.bytetonum(length[1]) & 127
IndexError: index out of range
remi.server.ws DEBUG ws ending websocket service
remi.server.ws INFO connection established: ('37.170.119.51', 2524)
remi.server.ws DEBUG handle
remi.server.ws DEBUG handshake
remi.server.ws INFO handshake complete
remi.server.ws DEBUG send_message: 0139974414943632,%3Cdiv%20id%3D%22loading%22%20class%3D%22Container%22%20data-parent-widget%3D%22139... -> ('37.170.119.51', 2524)
- Log on the client side:
opening websocket
Connection is closed... event code: 1006, reason:
failed connections=1 queued messages=0
opening websocket
Misc. notes This issue affects my code as, when I do a refresh, I also sometimes send some "user-defined" javascript code to the browser. If the issue happens before my manual javascript sending, so it's not executed and the page doesn't display fully. Apart from that, the client only sees the "waiting rectangle" then a new connection initiate. Also, I'm a bit confused as this bug was not present in Remi 2019.11.
NB: in the log above I increased the number of chars displayed by the sever in server.py send_message
to see a little bit more.
Could you please have a look and try to reproduce this ? Thanks, François
Hello @batzkass, wonderful exhaustive bug report, thank you.
There is a server parameter to correct the websocket communication timeout. In your case a longer timeout should fix the problem. The parameter is websocket_timeout_timer_ms. It have to be passed to the start
function call. Default value is 1000 (1 second).
Please let me know if this solves your problem.
Have a nice day.
Unfortunately this doesn't change anything, I went up to 10000. In addition, when the issue occurs it's almost instantaneous (subsequent to my click on the button), way less than 1 second. It looks like a sending buffer that is filled faster than it is emptied (by actual sending). Is there somewhere this kind of buffer size in the websocket lib ?
I think about something else : to solve the last issue we had, you added a test, whether the socket is writable or not. If the previous message is still not fully received or acknowledged, the socket may not be writable for the current one. Perhaps we should wait a bit an retry in this case before closing the connexion ?
Well, the following code seems to fix the issue, at least for the minimal example I gave on 4G connection and 5000 (!) labels.
def send_message(self, message):
if not self.handshake_done:
self._log.warning("ignoring message %s (handshake not done)" % message[:10])
return False
self._log.debug('send_message: %s... -> %s' % (message[:100], self.client_address))
out = bytearray()
out.append(129)
length = len(message)
if length <= 125:
out.append(length)
elif 126 <= length <= 65535:
out.append(126)
out += struct.pack('>H', length)
else:
out.append(127)
out += struct.pack('>Q', length)
if not pyLessThan3:
message = message.encode('utf-8')
out = out + message
for i in range(10):
readable, writable, errors = select.select([], [self.request,], [], 0) #last parameter is timeout, when 0 is non blocking
#self._log.debug('socket status readable=%s writable=%s errors=%s'%((self.request in readable), (self.request in writable), (self.request in error$
writable = self.request in writable
if writable:
self.request.sendall(out)
return True
time.sleep(0.1)
return False
EDIT: ... but I get again bad results on connection drop (server not reacting to clicks for example...). I hope all these information will help you.
@batzkass thank you very much for all this debug and support. Instead of creating a for loop it could be possible that a timeout other than 0 would solve the problem in this function call:
readable, writable, errors = select.select([], [self.request,], [], 0)
Or maybe we have to check which kind of error is returned. I will be able to test this on Wednesday ;-)
Well, a timeout of 2s seems to fix the issue (without the loop). But let me a few days to confirm this on real conditions. EDIT: it works 100% even with brutal connection drops. Thanks for the fix, this will drastically improve the user experience of our app. I'm just waiting for the pypi release to update the server instances I have.
@batzkass This should be fixed now (with a very long delay, excuse me). If you can give it a try it would be great, if you cannot than no worries. ;-)