cylc-ui icon indicating copy to clipboard operation
cylc-ui copied to clipboard

slow initial connection

Open oliver-sanders opened this issue 2 years ago • 22 comments

Updates

  • Issue where the websocket connection is closed before it is succseefully opened.
  • Observed in Firefox, Edge and Chrome so far with the following messages:
    • WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.
    • Firefox can’t establish a connection to the server at ws://<URL>.
  • Observed on Linux and Windows.
  • The websocket gets closed with error code 1006 and a blank reason. This is the same result as killing the UIS.
  • There is a related timeout issue which can present similarly, add debugging into the subscriptions-transport-ws code to tell the difference.
  • These failed requests do not reach the Tornado websocket implementation.

OP

Sometimes it takes a long time for the UI to connect with the backend.

It seems to get there in the end but it often takes in excess of 45 seconds to establish the subscription connection. The issue occurs randomly when the page is loaded and causes the no-connection error box hangs around.

Whilst we are waiting for the connection this message appears multiple times (Firefox):

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

Which suggests multiple failed attempts at connection, however, after jamming some logging into the UIS (backend server) I can see no evidence of connection attempts (at least none which actually get into the websocket code). The UIS responds quickly once it receives the request.

It's not clear what's causing this, however, it is worth noting that we are using a deprecated framework for subscriptions, we should try moving to something more modern and see if the problem goes away by itself - #1028.

oliver-sanders avatar Jan 10 '23 18:01 oliver-sanders

I don't think I have ever encountered this on Chrome

MetRonnie avatar Apr 28 '23 15:04 MetRonnie

Try FF/VDI, the lag can be up to one minute.

oliver-sanders avatar May 02 '23 08:05 oliver-sanders

I've replicated this with offline mode on the Vue 3 branch :(

oliver-sanders avatar May 11 '23 14:05 oliver-sanders

Assuming this is just an issue on VDI, what could be the cause?

MetRonnie avatar May 11 '23 15:05 MetRonnie

I'm assuming this is a general issue exacerbated by a laggy platform, you can see multiple connection attempts in the console.

oliver-sanders avatar May 11 '23 15:05 oliver-sanders

(What's VDI - virtual desktop?)

hjoliver avatar May 16 '23 23:05 hjoliver

I've done some investigation into this but have come out empty handed.

I've also tried migrating from graphql-transport-ws to graphql-ws, however, this did not fix the issue.

So far the error has only been reported in RedHat builds of Firefox where it presents like so:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

This error will persist across multiple retries, it may last for over one minute before clearing.

This happens randomly on page load, sometimes it connects straight away.

oliver-sanders avatar Aug 09 '23 15:08 oliver-sanders

P.s. I used to see this when developing the UI. Never investigated the cause though. Sometimes ffox would keep showing the red icon at the top, showing that I was offline and there would be those js messages.

Other times it was after I opened the browser the first time. It happened with just that five workflow, or with multiple workflows (i.e probably not the amount of data).

However, when I used that fake uiserver (offline mode or random tests in python + pre made request/response) I would not get that.

So my guess would be something between browser, uis, and first xhr and/or wss request. Maybe setting a debugger in some parts of uis graphene or web request handling code that could be busy (handling an old request, waiting for some background task?) would replicate the issue...

kinow avatar Oct 16 '23 22:10 kinow

This report on Bugzilla looks very similar:

Websocket ssl connection can be very slow to connect in Firefox

https://bugzilla.mozilla.org/show_bug.cgi?id=1478314

As does this issue:

Firefox takes multiple attempts to connect to websocket resulting in slow UI loads

https://github.com/fluidd-core/fluidd/issues/931

There a bunch of other similar sounding reports out there with varying levels of details. Here are some summarised statements pulled out of these reports:

  • WSS/HTTPS works, WS/HTTP doesn't (x2).
  • Regular cert works, self-signed cert doesn't.
  • Local host IP works, local host domain doesn't.
  • IPv4 works, IPV6 doesn't (x3).
  • Earlier versions of push.js works, later versions don't, (we aren't using push.js).

I've replicated the issue with:

  • HTTP/WS and HTTPS/WSS (self-signed).
  • localhost and 127.0.0.1.

oliver-sanders avatar Nov 13 '23 16:11 oliver-sanders

Issue now replicated on:

  • Firefox (Linux)
  • Firefox (Windows)
  • Chrome (Windows)
  • Edge (Windows)

Chromium error:

WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.

Gecko error:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

oliver-sanders avatar Nov 17 '23 13:11 oliver-sanders

Connection Timeout

After some experimenting, I've seen a couple cases where connections failed on connection timeout. The timeout starts at 1s and increases with each failure. This diff bumps the default 1s connection timeout up to 10s:

diff --git a/src/graphql/index.js b/src/graphql/index.js
index cc252c96..141ee0ef 100644
--- a/src/graphql/index.js
+++ b/src/graphql/index.js
@@ -74,7 +74,9 @@ export function getCylcHeaders () {
 export function createSubscriptionClient (wsUrl, options = {}, wsImpl = null) {
   const opts = Object.assign({
     reconnect: true,
-    lazy: false
+    lazy: false,
+    minTimeout: 10000, // default 1000
+    // timeout: 300000 // default 30000
   }, options)
   const subscriptionClient = new SubscriptionClient(wsUrl, opts, wsImpl)
   // these are the available hooks in the subscription client lifecycle

Which appears to clear connection timeout issues, however, this would appear to be a distraction which does not solve the main issue.

Connection closed before it's established

  • Cylc-ui uses Apollo Client for GraphQL.
  • We use it with subscriptions-transport-ws to provide websocket support.
  • subscriptions-transport-ws package creates and manages the websocket connections.
  • The issue occurs when the websocket's onclose handler (defined by subscriptions-transport-ws) gets called.
  • The onclose method does not appear to be called by our code or subscriptions-transport-ws suggesting it's invoked by the websocket implementation (debugger call stack not especially helpful here)?

Call stack (note line numbers shifted slightly by diff):

Screenshot from 2023-11-20 11-04-27

onclose first argument
close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, … }
​
bubbles: false
​
cancelBubble: false
​
cancelable: false
​
code: 1006
​
composed: false
​
currentTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​​
binaryType: "blob"
​​
bufferedAmount: 0
​​
extensions: ""
​​
onclose: function onclose(r, i, o)​​
onerror: function onerror(r)​​
onmessage: function onmessage(r)​​
onopen: function onopen()
​​
protocol: ""
​​
readyState: 3
​​
url: "wss://localhost:8888/cylc/subscriptions"
​​
<prototype>: WebSocketPrototype { close: close(), send: send(), url: Getter, … }
​
defaultPrevented: false
​
eventPhase: 2
​
explicitOriginalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
isTrusted: true
​
originalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
reason: ""
​
returnValue: true
​
srcElement: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
target: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
timeStamp: 6989
​
type: "close"
​
wasClean: false
​
<get isTrusted()>: function isTrusted()
​
<prototype>: CloseEventPrototype { wasClean: Getter, code: Getter, reason: Getter, … }

oliver-sanders avatar Nov 20 '23 11:11 oliver-sanders

Looks like I can reproduce the problem by using a custom network profile in Chrome with latency of ≥ 3 secs! This might explain why you in particular keep seeing this, as you are plagued by bad ping times?

MetRonnie avatar Nov 20 '23 11:11 MetRonnie

I think that would cause the connection timeout issue I mentioned above which I think presents in the same way, though should go away after enough repeats (timeouts willing). (note ping time issues shouldn't affect connections to local servers).

To tell the difference, I've been jamming debugging into subscriptions-transport-ws/dist/client.js.

This bit handles connection timeout issues:

438     SubscriptionClient.prototype.checkMaxConnectTimeout = function () {
439         var _this = this;
440         this.clearMaxConnectTimeout();
441         var duration = this.maxConnectTimeGenerator.duration()
442         this.maxConnectTimeoutId = setTimeout(function () {
443             if (_this.status !== _this.wsImpl.OPEN) {
444                 _this.reconnecting = true;
445                 console.warn(`Closed by max connect timeout: ${duration}`)                                                     
446                 _this.close(false, true);
447             }    
448         }, duration);
449     };         

But this one is the real issue:

483         this.client.onclose = function (code, reason, wasClean) {
484             console.log('# onclose', code, reason, wasClean)
485             if (!_this.closedByUser) {
486                 // TODO: this would appear to be the source of the problem!
487                 console.warn('Closed by onclose')
488                 debugger
489                 _this.close(false, false);
490             }    
491         };    

oliver-sanders avatar Nov 20 '23 11:11 oliver-sanders

From testing with Dave on Friday (who seldom notices the issue), we found evidence of this issue in the console more times than not, i.e. the issue hist most of the time, but isn't noticed unless it persists for long enough. Worth checking the console for these messages.

oliver-sanders avatar Nov 20 '23 11:11 oliver-sanders

I pretty much always have the devtools open when I access the UI. If the "you are offline" snackbar goes away within ~2 secs I never see the

WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.

warning

MetRonnie avatar Nov 20 '23 14:11 MetRonnie

And when it doesn't you do?

oliver-sanders avatar Nov 20 '23 14:11 oliver-sanders

With my 3s latency profile enabled, yes I get those warnings. I actually get a few before the main content is displayed on the page

MetRonnie avatar Nov 20 '23 15:11 MetRonnie

That approach is forcing a timeout error which is a different problem (the socket gets closed in a different part of the code, see above).

Note, this issue is extremely intermittent, I've just had 30 reloads in a row with no failed connections whatsoever, it's not necessarily easy for me to reproduce, I just have to brute force it.

oliver-sanders avatar Nov 20 '23 15:11 oliver-sanders

From manual inspection, the websocket error code is 1006 which is:

Reserved for situations where the connection was closed abnormally without sending a close frame when a status code was expected.

This appears to mean that the connection was closed unexpectedly (no close message), but I'm not sure how it's determining that, is ping-pong established this early on.

Note, the reason field is blank. This exactly matches the response you get when you stop a running server.

oliver-sanders avatar Nov 20 '23 15:11 oliver-sanders

@oliver-sanders ' in this issue show what I can now easily reproduce locally :tada:

I was reading an article on JS and performance for Autosubmit UI, when I noticed the author commented something interesting about browser settings that control the maximum number of connections to a server. The author also mentioned about this being a problem to WebSockets, which is why some people used a service workers to have a single client connecting to the server.

Anyhoo, I opened Firefox, then about:config, and searched for websocket. It returned only 12, with a couple of entries for devtools (these can be discarded).

Everything was working fine in the UI, and network.websocket.max-connections was set to 200 by default. Then I opened a few tabs, reduced it to 30, opened a few more tabs with the Cylc GUI, and voilà

image

I started closing my tabs and... eventually it connected.

image

I can't spend much more time on this issue, but I think

  • this is a plausible explanation, as that reflects the scenario I had when the issue happened.... I was working on several tabs in a browser,
  • the bug happened with complex and simple experiments
  • the bug would go away in some seconds, or if I closed other tabs or the browser and opened it again

Open questions (that I wish I had time to keep digging :nerd_face:)

  • I tried monitoring the connections with lsof and netstat but I got different numbers than 30. I was expecting to find a way to filter by process + server (socket, domain name, etc.).
  • When a WebSocket is closed, does it mean the browser connection (or its object in some pool?) is discarded? When the maxCount variable is decreased?
  • Any way to monitor that number and find a better error message, maybe somewhere in the stack trace it says "Reached Maximum Connections" or something?
  • There are other settings like delays for reconnect, timeouts, would any of that help?
  • Is there anything else on the server side (tornado/jupyter/graphql-ws that could be throttling the max connections as well? i.e. firefox/browser causes the issue, but maybe other parts of the system do that too?, or even linux :man_shrugging: )
  • I only know how to manage websockets connections in Ffox, but there should be a similar setting in Chrome/Chromium/Edge/Safari/Opera/etc.

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

Possible solution:

  • Use a SharedWorker (and this SO) in the UI to reuse the WebSockets connection
    • Does it work with current Cylc arch? (I'm using Cylc for small experiments but not following the dev closely)
    • Any security issues when two tabs in the same browser have different users logged in (e.g. Firefox container tabs)?
  • Use some other type of Service Worker?
  • Ask users to manage the number of max connections, especially if they monitor multi experiments in multi tabs, maybe update the UI error message too...
  • Replace WebSocket by something else? (LongPolling, WebRTC, something from HTTP/3? :man_shrugging: )

I don't have this issue as we are not using WebSockets in the Autosubmit UI, so this is as far as my curiosity took me :slightly_smiling_face: , hope that helps.

Bruno

kinow avatar Mar 30 '24 14:03 kinow

Just commenting (and following) that I see this a lot too. But, I haven't seen this error message mentioned, so I'll add it (it always happens after the "can't establish a connection to the server" line.

The connection to wss://..../cylc/subscriptions was interrupted while the page was loading.

ColemanTom avatar Aug 05 '24 00:08 ColemanTom

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

It isn't just if you have many tabs. I have a single tab open and this happens. The page eventually loads, but it is slow.

ColemanTom avatar Aug 05 '24 00:08 ColemanTom