labgrid
labgrid copied to clipboard
Add retry loop to coordinator connection
Hi Labgrid,
When connecting a coordinator from a far network location (i.e cloud instance), some times, a connection timeout happens producing test failures:
WARNING: dropping connection to peer tcp4:192.168.1.1:20408 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
This error comes from how Autobahn library handles the Websocket connection. The 'openHandshakeTimeout' websocket parameter is hardcoded to 2.5 seconds in the Autobahn library and there is no means to configure it. This issue has been already reported to Autobahn mantainers:
https://github.com/crossbario/autobahn-python/issues/838
This pull request mitigates this problem (while the Autobahn is fixed/updated). Solution: to implement a retry loop when connections are open in remote.py.
- [x] PR has been tested
BR,
Santi
Codecov Report
Merging #671 (f663ee8) into master (a1b97da) will increase coverage by
0.0%. The diff coverage is63.6%.
@@ Coverage Diff @@
## master #671 +/- ##
======================================
Coverage 58.7% 58.7%
======================================
Files 127 127
Lines 9063 9069 +6
======================================
+ Hits 5324 5328 +4
- Misses 3739 3741 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| labgrid/resource/remote.py | 82.0% <63.6%> (-0.5%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update a1b97da...f663ee8. Read the comment docs.
The linked issue has been resolved in https://github.com/crossbario/autobahn-python/pull/912, I'd rather fix this at the source and make this fully configurable by overriding the method.
When connecting a coordinator from a far network location (i.e cloud instance)
I hope the connection to your cloud instance is done via a VPN, since labgrid does not provide any kind of authentication layer, practically exposing you information and devices to the internet.
After looking in detail at crossbario/autobahn-python#912, that merge has been reverted. Nobody is working on the proposed solution.
I have a look at it and it seems more difficult than appear at first glance.
After looking in detail at crossbario/autobahn-python#912, that merge has been reverted. Nobody is working on the proposed solution.
I have a look at it and it seems more difficult than appear at first glance.
How unfortunate. My problem with this retry loop is that we are masking a problem which will make later failures hard to debug, i.e. having the connection not working from the start due to network problems is easier to understand than having the connection randomly be lost during a test run or remote usage. The whole WAMP library is rather sensitive to components not answering to websocket pings and will result in errors in the test runs or remote usage. I'm not confident its advisable to use labgrid via a cloud instance.
This retry loop only applies to the initial handshake. I have not detected any error while executing test remotely. The 'openHandshakeTimeout' default value is set to 2.5 seconds, and most of the times this limit is meet, only sometimes fails (I guess due to networking variable conditions).
I have created a solution to be able to setup the 'openHandshake' parameter (or others) in Autobahn: https://github.com/crossbario/autobahn-python/pull/1429
Hopefully they like the solution and it will be merged soon
Thanks for the upstream PR. I'd be in favor of merging this fix temporarily, until we can revert it to implement a fully configurable solution, if your upstream PR is accepted.
Closing this, longterm plan is to move away from crossbar.