cylc-ui
cylc-ui copied to clipboard
slow initial connection
Updates
- Issue where the websocket connection is closed before it is succseefully opened.
- Observed in Firefox, Edge and Chrome so far with the following messages:
- WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.
- Firefox can’t establish a connection to the server at ws://<URL>.
- Observed on Linux and Windows.
- The websocket gets closed with error code 1006 and a blank reason. This is the same result as killing the UIS.
- There is a related timeout issue which can present similarly, add debugging into the subscriptions-transport-ws code to tell the difference.
- These failed requests do not reach the Tornado websocket implementation.
OP
Sometimes it takes a long time for the UI to connect with the backend.
It seems to get there in the end but it often takes in excess of 45 seconds to establish the subscription connection. The issue occurs randomly when the page is loaded and causes the no-connection error box hangs around.
Whilst we are waiting for the connection this message appears multiple times (Firefox):
Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.
Which suggests multiple failed attempts at connection, however, after jamming some logging into the UIS (backend server) I can see no evidence of connection attempts (at least none which actually get into the websocket code). The UIS responds quickly once it receives the request.
It's not clear what's causing this, however, it is worth noting that we are using a deprecated framework for subscriptions, we should try moving to something more modern and see if the problem goes away by itself - #1028.
I don't think I have ever encountered this on Chrome
Try FF/VDI, the lag can be up to one minute.
I've replicated this with offline mode on the Vue 3 branch :(
Assuming this is just an issue on VDI, what could be the cause?
I'm assuming this is a general issue exacerbated by a laggy platform, you can see multiple connection attempts in the console.
(What's VDI - virtual desktop?)
I've done some investigation into this but have come out empty handed.
I've also tried migrating from graphql-transport-ws
to graphql-ws
, however, this did not fix the issue.
So far the error has only been reported in RedHat builds of Firefox where it presents like so:
Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.
This error will persist across multiple retries, it may last for over one minute before clearing.
This happens randomly on page load, sometimes it connects straight away.
P.s. I used to see this when developing the UI. Never investigated the cause though. Sometimes ffox would keep showing the red icon at the top, showing that I was offline and there would be those js messages.
Other times it was after I opened the browser the first time. It happened with just that five workflow, or with multiple workflows (i.e probably not the amount of data).
However, when I used that fake uiserver (offline mode or random tests in python + pre made request/response) I would not get that.
So my guess would be something between browser, uis, and first xhr and/or wss request. Maybe setting a debugger in some parts of uis graphene or web request handling code that could be busy (handling an old request, waiting for some background task?) would replicate the issue...
This report on Bugzilla looks very similar:
Websocket ssl connection can be very slow to connect in Firefox
https://bugzilla.mozilla.org/show_bug.cgi?id=1478314
As does this issue:
Firefox takes multiple attempts to connect to websocket resulting in slow UI loads
https://github.com/fluidd-core/fluidd/issues/931
There a bunch of other similar sounding reports out there with varying levels of details. Here are some summarised statements pulled out of these reports:
- WSS/HTTPS works, WS/HTTP doesn't (x2).
- Regular cert works, self-signed cert doesn't.
- Local host IP works, local host domain doesn't.
- IPv4 works, IPV6 doesn't (x3).
- Earlier versions of
push.js
works, later versions don't, (we aren't usingpush.js
).
I've replicated the issue with:
- HTTP/WS and HTTPS/WSS (self-signed).
-
localhost
and127.0.0.1
.
Issue now replicated on:
- Firefox (Linux)
- Firefox (Windows)
- Chrome (Windows)
- Edge (Windows)
Chromium error:
WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.
Gecko error:
Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.
Connection Timeout
After some experimenting, I've seen a couple cases where connections failed on connection timeout. The timeout starts at 1s and increases with each failure. This diff bumps the default 1s connection timeout up to 10s:
diff --git a/src/graphql/index.js b/src/graphql/index.js
index cc252c96..141ee0ef 100644
--- a/src/graphql/index.js
+++ b/src/graphql/index.js
@@ -74,7 +74,9 @@ export function getCylcHeaders () {
export function createSubscriptionClient (wsUrl, options = {}, wsImpl = null) {
const opts = Object.assign({
reconnect: true,
- lazy: false
+ lazy: false,
+ minTimeout: 10000, // default 1000
+ // timeout: 300000 // default 30000
}, options)
const subscriptionClient = new SubscriptionClient(wsUrl, opts, wsImpl)
// these are the available hooks in the subscription client lifecycle
Which appears to clear connection timeout issues, however, this would appear to be a distraction which does not solve the main issue.
Connection closed before it's established
- Cylc-ui uses Apollo Client for GraphQL.
- We use it with subscriptions-transport-ws to provide websocket support.
- subscriptions-transport-ws package creates and manages the websocket connections.
- The issue occurs when the websocket's onclose handler (defined by subscriptions-transport-ws) gets called.
- The onclose method does not appear to be called by our code or subscriptions-transport-ws suggesting it's invoked by the websocket implementation (debugger call stack not especially helpful here)?
Call stack (note line numbers shifted slightly by diff):
onclose first argument
close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, … }
bubbles: false
cancelBubble: false
cancelable: false
code: 1006
composed: false
currentTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
binaryType: "blob"
bufferedAmount: 0
extensions: ""
onclose: function onclose(r, i, o)
onerror: function onerror(r)
onmessage: function onmessage(r)
onopen: function onopen()
protocol: ""
readyState: 3
url: "wss://localhost:8888/cylc/subscriptions"
<prototype>: WebSocketPrototype { close: close(), send: send(), url: Getter, … }
defaultPrevented: false
eventPhase: 2
explicitOriginalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
isTrusted: true
originalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
reason: ""
returnValue: true
srcElement: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
target: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
timeStamp: 6989
type: "close"
wasClean: false
<get isTrusted()>: function isTrusted()
<prototype>: CloseEventPrototype { wasClean: Getter, code: Getter, reason: Getter, … }
Looks like I can reproduce the problem by using a custom network profile in Chrome with latency of ≥ 3 secs! This might explain why you in particular keep seeing this, as you are plagued by bad ping times?
I think that would cause the connection timeout issue I mentioned above which I think presents in the same way, though should go away after enough repeats (timeouts willing). (note ping time issues shouldn't affect connections to local servers).
To tell the difference, I've been jamming debugging into subscriptions-transport-ws/dist/client.js
.
This bit handles connection timeout issues:
438 SubscriptionClient.prototype.checkMaxConnectTimeout = function () {
439 var _this = this;
440 this.clearMaxConnectTimeout();
441 var duration = this.maxConnectTimeGenerator.duration()
442 this.maxConnectTimeoutId = setTimeout(function () {
443 if (_this.status !== _this.wsImpl.OPEN) {
444 _this.reconnecting = true;
445 console.warn(`Closed by max connect timeout: ${duration}`)
446 _this.close(false, true);
447 }
448 }, duration);
449 };
But this one is the real issue:
483 this.client.onclose = function (code, reason, wasClean) {
484 console.log('# onclose', code, reason, wasClean)
485 if (!_this.closedByUser) {
486 // TODO: this would appear to be the source of the problem!
487 console.warn('Closed by onclose')
488 debugger
489 _this.close(false, false);
490 }
491 };
From testing with Dave on Friday (who seldom notices the issue), we found evidence of this issue in the console more times than not, i.e. the issue hist most of the time, but isn't noticed unless it persists for long enough. Worth checking the console for these messages.
I pretty much always have the devtools open when I access the UI. If the "you are offline" snackbar goes away within ~2 secs I never see the
WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.
warning
And when it doesn't you do?
With my 3s latency profile enabled, yes I get those warnings. I actually get a few before the main content is displayed on the page
That approach is forcing a timeout error which is a different problem (the socket gets closed in a different part of the code, see above).
Note, this issue is extremely intermittent, I've just had 30 reloads in a row with no failed connections whatsoever, it's not necessarily easy for me to reproduce, I just have to brute force it.
From manual inspection, the websocket error code is 1006 which is:
Reserved for situations where the connection was closed abnormally without sending a close frame when a status code was expected.
This appears to mean that the connection was closed unexpectedly (no close message), but I'm not sure how it's determining that, is ping-pong established this early on.
Note, the reason field is blank. This exactly matches the response you get when you stop a running server.
@oliver-sanders ' in this issue show what I can now easily reproduce locally :tada:
I was reading an article on JS and performance for Autosubmit UI, when I noticed the author commented something interesting about browser settings that control the maximum number of connections to a server. The author also mentioned about this being a problem to WebSockets, which is why some people used a service workers to have a single client connecting to the server.
Anyhoo, I opened Firefox, then about:config
, and searched for websocket
. It returned only 12, with a couple of entries for devtools (these can be discarded).
Everything was working fine in the UI, and network.websocket.max-connections
was set to 200
by default. Then I opened a few tabs, reduced it to 30
, opened a few more tabs with the Cylc GUI, and voilà
I started closing my tabs and... eventually it connected.
I can't spend much more time on this issue, but I think
- this is a plausible explanation, as that reflects the scenario I had when the issue happened.... I was working on several tabs in a browser,
- the bug happened with complex and simple experiments
- the bug would go away in some seconds, or if I closed other tabs or the browser and opened it again
Open questions (that I wish I had time to keep digging :nerd_face:)
- I tried monitoring the connections with
lsof
andnetstat
but I got different numbers than30
. I was expecting to find a way to filter by process + server (socket, domain name, etc.). - When a WebSocket is closed, does it mean the browser connection (or its object in some pool?) is discarded? When the maxCount variable is decreased?
- Any way to monitor that number and find a better error message, maybe somewhere in the stack trace it says "Reached Maximum Connections" or something?
- There are other settings like delays for reconnect, timeouts, would any of that help?
- Is there anything else on the server side (tornado/jupyter/graphql-ws that could be throttling the max connections as well? i.e. firefox/browser causes the issue, but maybe other parts of the system do that too?, or even linux :man_shrugging: )
- I only know how to manage websockets connections in Ffox, but there should be a similar setting in Chrome/Chromium/Edge/Safari/Opera/etc.
So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).
Possible solution:
- Use a
SharedWorker
(and this SO) in the UI to reuse the WebSockets connection- Does it work with current Cylc arch? (I'm using Cylc for small experiments but not following the dev closely)
- Any security issues when two tabs in the same browser have different users logged in (e.g. Firefox container tabs)?
- Use some other type of Service Worker?
- Ask users to manage the number of max connections, especially if they monitor multi experiments in multi tabs, maybe update the UI error message too...
- Replace WebSocket by something else? (LongPolling, WebRTC, something from HTTP/3? :man_shrugging: )
I don't have this issue as we are not using WebSockets in the Autosubmit UI, so this is as far as my curiosity took me :slightly_smiling_face: , hope that helps.
Bruno
Just commenting (and following) that I see this a lot too. But, I haven't seen this error message mentioned, so I'll add it (it always happens after the "can't establish a connection to the server" line.
The connection to wss://..../cylc/subscriptions was interrupted while the page was loading.
So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).
It isn't just if you have many tabs. I have a single tab open and this happens. The page eventually loads, but it is slow.