labgrid icon indicating copy to clipboard operation
labgrid copied to clipboard

Add retry loop to coordinator connection

Open sestebanz opened this issue 4 years ago • 6 comments

Hi Labgrid,

When connecting a coordinator from a far network location (i.e cloud instance), some times, a connection timeout happens producing test failures:

WARNING: dropping connection to peer tcp4:192.168.1.1:20408 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)

This error comes from how Autobahn library handles the Websocket connection. The 'openHandshakeTimeout' websocket parameter is hardcoded to 2.5 seconds in the Autobahn library and there is no means to configure it. This issue has been already reported to Autobahn mantainers:

https://github.com/crossbario/autobahn-python/issues/838

This pull request mitigates this problem (while the Autobahn is fixed/updated). Solution: to implement a retry loop when connections are open in remote.py.

  • [x] PR has been tested

BR,

Santi

sestebanz avatar Nov 23 '20 12:11 sestebanz

Codecov Report

Merging #671 (f663ee8) into master (a1b97da) will increase coverage by 0.0%. The diff coverage is 63.6%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #671   +/-   ##
======================================
  Coverage    58.7%   58.7%           
======================================
  Files         127     127           
  Lines        9063    9069    +6     
======================================
+ Hits         5324    5328    +4     
- Misses       3739    3741    +2     
Impacted Files Coverage Δ
labgrid/resource/remote.py 82.0% <63.6%> (-0.5%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a1b97da...f663ee8. Read the comment docs.

codecov[bot] avatar Nov 23 '20 12:11 codecov[bot]

The linked issue has been resolved in https://github.com/crossbario/autobahn-python/pull/912, I'd rather fix this at the source and make this fully configurable by overriding the method.

When connecting a coordinator from a far network location (i.e cloud instance)

I hope the connection to your cloud instance is done via a VPN, since labgrid does not provide any kind of authentication layer, practically exposing you information and devices to the internet.

Emantor avatar Nov 23 '20 13:11 Emantor

After looking in detail at crossbario/autobahn-python#912, that merge has been reverted. Nobody is working on the proposed solution.

I have a look at it and it seems more difficult than appear at first glance.

sestebanz avatar Nov 23 '20 17:11 sestebanz

After looking in detail at crossbario/autobahn-python#912, that merge has been reverted. Nobody is working on the proposed solution.

I have a look at it and it seems more difficult than appear at first glance.

How unfortunate. My problem with this retry loop is that we are masking a problem which will make later failures hard to debug, i.e. having the connection not working from the start due to network problems is easier to understand than having the connection randomly be lost during a test run or remote usage. The whole WAMP library is rather sensitive to components not answering to websocket pings and will result in errors in the test runs or remote usage. I'm not confident its advisable to use labgrid via a cloud instance.

Emantor avatar Nov 25 '20 06:11 Emantor

This retry loop only applies to the initial handshake. I have not detected any error while executing test remotely. The 'openHandshakeTimeout' default value is set to 2.5 seconds, and most of the times this limit is meet, only sometimes fails (I guess due to networking variable conditions).

I have created a solution to be able to setup the 'openHandshake' parameter (or others) in Autobahn: https://github.com/crossbario/autobahn-python/pull/1429

Hopefully they like the solution and it will be merged soon

sestebanz avatar Nov 25 '20 13:11 sestebanz

Thanks for the upstream PR. I'd be in favor of merging this fix temporarily, until we can revert it to implement a fully configurable solution, if your upstream PR is accepted.

Emantor avatar Nov 25 '20 13:11 Emantor

Closing this, longterm plan is to move away from crossbar.

Emantor avatar Mar 25 '24 14:03 Emantor