rcloud allow re-connect when connection was lost due to network issues

When the websocket connection was disconnected intermittently (due to network issues), allow immediate re-connect. This is a very special case of #777 - the idea is that the there is no state to re-create but rather just restore an existing connection/pipe.

One way to implement this would be to leverage proxified setup - delay shutdown due to websocket close by a short time - let's say a minute or two - and allow the user to re-connect directly in that timeframe by supplying the session id and some key. Since the connection to the actual session is not interrupted, only the proxy needs to support this. It could be implemented low enough on the JS side such that the bookkeeping on both sides (such as registered OCAPs) are retained until re-connect and thus there is no need to change anything at the high level.

(FWIW: what prompted me to file were unexpected lost connections even to tupile)

Feb 18 '16 16:02 s-u

+1

Shouldn't the client also attempt to reconnect automatically? Assuming we can detect this situation.

I guess we can start by making it manual and then see if the automatic makes sense.

Feb 18 '16 17:02 gordonwoodhull

Sure, automatic would be best, because it would limit the time that something goes wrong (e.g., what if the user tries to send anything while the connection is down?).

Feb 18 '16 20:02 s-u

This doesn't sound all that hard. We should do this. It's the thing that everyone asks me about when we chat about RCloud.

Feb 18 '20 20:02 gordonwoodhull

On the JS side I am sure that all ocaps will be preserved.

Basically the RClient object will attempt to restart the Rserve object.

And the server will provide a secret key when session is first started, and proxy will accept session id and secret key to reconnect.

~~For development, I have to figure out why I never got the proxy to run correctly on MacOS. Something about rserve.conf IIRC.~~ Proxy appears to work as of 3/11/20.

Mar 04 '20 16:03 gordonwoodhull

As for the UI, I think every action has to be disabled while the client is reconnecting, but typing code should still be okay.

Ideally, as soon as the client detects that the websocket has died, all buttons go gray, a spinner shows somewhere in the navbar, you can't do anything but move between cells/assets and type. The client keeps trying to reconnect without any interaction with the user.

The spinner should also indicate, maybe with color, whether it currently is retrying or waiting. We could try exponential backoff for the case where network or vpn is disconnected.

Finally, if you click on the spinner, it should stop trying to reconnect. Otherwise you wouldn't be able to stop it without leaving the page, and that seems unsafe.

For some reason I imagine the reconnecting spinner replacing the Save (disk) icon. The one we have on page load is at the right of the navbar icons, but I think it could be moved. Alternately, we could keep it in the same place and hide the notebook title when reconnecting.

Mar 30 '20 19:03 gordonwoodhull

Some of the Aw Shucks fatal errors are truly fatal. Auto-reconnect should be disabled in those cases because the session is in a bad state and needs a page reload.

Mar 30 '20 19:03 gordonwoodhull

The part I don't understand is that forward forks on connection - so that process can stay alive alive and connected to the session, but how would the new connection from the browser find its way to the right process?

Jul 01 '20 00:07 gordonwoodhull

Exactly, there is no way. We would have to create an entirely new mechanism - that's why it is so hard. Couple years ago I was experimenting with the feature of unix to pass FDs via special IPC - that allows essentially one process to pass its FD to another process for it to use - but obviously it only works on the same host. That way the new process would find the old session process and ask for its socket descriptor so it could route the traffic to it, as far as the session process goes it wouldn't be able to tell that there is now a different process feeding the packets.

But even with that, there are still two issues:

the session process still has to be aware of the "limbo" state, because when it sends something to the FD while it is not connected, it would get SIGPIPE so not good. What do we do with messages in that state? If those are just OOBs then it's probably ok to simply drop them, but what if it is something else?
pesky detail that the client still has to connect to the right host. We could add another proxy that routes it form one host to the session host, but that sounds a bit convoluted and potentially fragile. IT woudl be nice if we had a more well-defined notion where the client can request communication with the appropriate host. With the likes of #2714 we are getting closer to that - but there is still some complexity involved especially if there is only one gateway host which is a common setup. We could replace nginx with forward - perhaps some testing would be in order to see if it is actually better or not.

Jul 01 '20 02:07 s-u

I'd like to rely more on nginx, rather than less, because nginx offers load balancing for users who don't opt into #2714.

Here's the idea I keep turning over in my head. I don't know if it's possible, so feel free to shoot it down.

proxy grabs a unique port so that it is findable in the limbo state, registers this somehow
nginx maintains a mapping from session ids to host:port
client connects to ws://gateway/session-id and nginx reverse-proxies to the correct forward process
if the connection goes down, client disables all UI, and any calls that still happen can be dropped (are there other non-OOBs to be concerned about other than UI?)
client tries to connect automatically using the same ws:// address - nginx knows where to reconnect

This is no less efficient than what we have on nginx load-balanced setups right now (except for the extra restful call needed for #2714). It requires dynamic mapping of URLs, and I am assuming that nginx has some way of doing that.

The unique port in point 1 is the key - arguably the rest is just sugar if the compute nodes are visible to the client.

Fire away!

Jul 06 '20 16:07 gordonwoodhull

There is no opt-in to #2714 - we either implement it or not. I'm not sure I understand why should we need another proxy on another port - a proxy is proxy so one per host should do, no? If I read your proposal correctly then you are proposing to have nginx do a REST call to find the target host of the session so it can connect to the correct host. That makes sense - the only drawback is that it doesn’t work without nginx and requires yet another process, but this time just on the metadata server. Another drawback is possibly performance as this means yet anither HTTP call in the middle. The upshot is that this could also solve #2714 if we add the correct logic to the REST server. This is all assuming that nginx can change the upstream based on a REST call.

Jul 06 '20 21:07 s-u

I meant that the user (notebook) may not specify a particular host or criteria for #2714, and in that case we may want to use nginx load-balancing.

I haven't thought through all the details, but I am imagining that the client will do a REST call to our own service to request a new session and retrieve the websocket URL.

If the system uses nginx, the returned URL will be a special session-id URL on the gateway; if it does not, the URL will be the compute host and the port it has registered for listening (direct connection from client to forward as now, but with a unique URL due to allocated port).

Jul 06 '20 21:07 gordonwoodhull

First, my quick search didn't reveal any feature of nginx to select the target proxy based on a REST call - it can only do subrequests for authentication. So that sort of kills the idea.

However, nginx supports a "sticky" cookie: http://nginx.org/en/docs/http/ngx_http_upstream_module.html#sticky

I wonder if we could leverage that. It is a bit tricky, because we don't want it to be truly sticky for all connections, so somehow we'd have to set it only for the tab, but the cookie is actually well-defined (MD5 hash of the upstream entry) so we could even generate it at the back-end.

I still don't get what you mean "allocated port" - the client will always connect to forward which is always on the same port, which then connects to the QAP (Rserve) that runs the session.

Jul 06 '20 22:07 s-u

I meant that on requesting a session, forward/proxy would be spawned and would listen on a unique port, instead of listening on whatever default websocket port it is using now.

Now it has a unique URL and we can find it again, rather than worrying about trying to steal a FD from a running process or whatever.

For load balancing in #2714 I was thinking we could dynamically configure nginx somehow but now I agree that's probably impossible. But that's only relevant to #2714, as long as the compute nodes are accessible directly from the client, and load balancing is an optional feature anyway, so let's put that aside.

So the steps would be

client requests a new session via REST call to, say session.R, supplying notebook id
session.R decides what compute host to put the new session based on notebook metadata
session.R asks that host for a new forward process
forward forks and looks for an available, unique port and gets it back to session.R somehow (I am fuzzy how session.R is "asking" the host)
session.R returns ws://compute-node:unique-port/ back to client
client can now connect and reconnect websocket connections to the right forward process because it has a unique URL

Jul 06 '20 22:07 gordonwoodhull

I also looked at stickiness but it doesn't help with finding the exact forked process, which IIUC is the crux of the current issue.

Jul 06 '20 22:07 gordonwoodhull

Starting another proxy doesn't help at all - it is irrelevant as that is just a proxy process, a pass-though and we can add any logic to it we need. Whatever we use still have to connect to the original process with QAP. There is no need for any additional process.

The main issue is for nginx to connect to the right host, it is still connecting to the forward process on that host which is running on single port, that's perfectly fine, we have things under control from there.

The whole thing for #2714 is much easier - if it's not behind nginx then it becomes trivial - the client just queries a REST API with the notebook-id to get the URL ws://... which is the URL for the RCloud service on the desired host - that's it. You never ever need another forward proceess - there is just one per host, there is no point of having more, it's just a QAP proxy.

Jul 06 '20 23:07 s-u

I also looked at stickiness but it doesn't help with finding the exact forked process, which IIUC is the crux of the current issue.

No, the issue is not the process, it is the host. The proxy on a host can handle the local connection to anything, we have full control over that. The problem is how does the client get to the proxy on the correct host in the first place?

Jul 06 '20 23:07 s-u

Maybe to clarify the design here: Rserve processes in proxified setup listen for local unix connections and use the QAP (Rserve) protocol. The forward proxy is a TCP server which once connected, connects to a local socket and translates WebSocket frames to QAP messages on that local connection. That's why you only need one proxy per host, since it is stateless, it is just a conduit. That's also why locally there is no issue, since there is nothing preventing a "standby" Rserve to become available on let's say a local socket with the session ID so the proxy can resume the connection at any time if it gets a request for a particular session. Minor pesky detail here is just authentication, but fortunately that only has to be handled between the client and server, i.e. the client would have to send a key/token or something like that to the RCloud process as a first message so it can verify that it is authorized (that makes the server part a bit annoying since it has to enter this special state after disconnect, but doable).

So everything is fine, so long as the client has a way to find the correct host. That is the main problem I'm worrying about - the proxy can only connect to the local sockets (for security reason we don't bind QAP on TCP as that opens a pandora box), so you have to connect to the same host that the session is on. If we can solve that then the rest is doable.

Jul 06 '20 23:07 s-u

Mostly got that, I think.

If finding the compute host is the problem, why not solve #2714 at the same time and have a RESTful service that returns a URL directly to the compute host, rather than transforming the https:// URL into ws:// on the client as we do now? (Is there anything magic about "upgrades"?)

What is nginx buying us as a websocket reverse proxy? Just load balancing?

Jul 06 '20 23:07 gordonwoodhull

I don't understand "one proxy per host" or "stateless". I thought forward forked for each connection (I see lots of these processes), and that the state was approximately one word: the file descriptor of the QAP local socket / named pipe.

Jul 07 '20 00:07 gordonwoodhull

Alternately, if the user notebook doesn't specify any interest, then maybe nginx sticky load balancing would work? I thought that we had trouble finding the process holding that QAP file descriptor.

Sorry I keep editing my comments - trying to be absolutely clear here.

Jul 07 '20 00:07 gordonwoodhull

If finding the compute host is the problem, why not solve #2714 at the same time and have a RESTful service that returns a URL directly to the compute host, rather than transforming the https:// URL into ws:// on the client as we do now? (Is there anything magic about "upgrades"?)

I'm not sure I parse it correctly - returning the URL to the host was exactly what I had in mind. However, if there is no proxy you still have the issue with cookie scope (which hosts share the authentication tokens etc.).

What is nginx buying us as a websocket reverse proxy? Just load balancing?

That, isolation+caching (if we want it) and single point of entry which is important for deployment (or else you need SSL certs for each host, open firewall to multiple hosts etc.).

Jul 07 '20 01:07 s-u

I don't understand "one proxy per host" or "stateless". I thought forward forked for each connection (I see lots of these processes), and that the state was approximately one word: the file descriptor of the QAP local socket / named pipe.

Yes. I meant that the proxy itself has no state, you can replace it with another process without any loss - that's what I had in mind for re-connect. And one proxy I mean there is only one port bound - as opposed to the additional ports you were mentioning earlier. Since there is no state, you need no additional ports.

Jul 07 '20 01:07 s-u

Got it.

Okay, so if it's just the list of hosts that will change, and we don't have to worry about session id at the nginx level, let's dynamically configure nginx as I previously linked, with a restart but no loss to existing connections, whenever the list of compute nodes changes.

Then each compute node could be available as ws://gateway/node-3/ or whatever. That's what nginx excels at.

Aren't we already dynamically configuring rcloud.social? We just need to add some rewrite rules at the same time.

When RCloud is running without nginx it could return direct websocket URLs to the compute hosts, but then it would have all the SSL and cookie troubles that you mention.

I still don't understand how one forward process gobbles up and replaces an old one, but it sounds like you have that figured out.

Jul 07 '20 01:07 gordonwoodhull

The changes on the client side are pretty huge but I think I can handle them.

It sounds like we will lose some output if the session is printing using OOBs. In-band, I wish the session process would just block and wait for the client to return. Is this naive?

Or ideally, OOBs should be queued and saved, but in-band blocks. Because what can the R process do, it's logically part of a distributed machine.

Jul 07 '20 01:07 gordonwoodhull

Yes, right, we can just use re-writes. It's probably something we should test since I don't know if we have places that make silent assumptions about living in the root.

As of the session (R side), I think it's perfectly fine to just block - R is synchronous anyway so in theory there should be no one disturbing the peace. The only thing I may have to check is the stdout/err forwarding thread. Also being able to do a fully blocking accept would make the re-connect implementation much easier, since we can just loop in there until someone valid shows up (or we time out and decide to give up) and that simply replaces the socket with the new one.

As said earlier, the re-connect is not entirely trivial and will require some negotiation, because you should be only able to re-connect to something you own, but Rserve itself has no notion of authentication so it's unclear how you prove you're the valid owner of the session. Some client code will need to do that dance - whatever we decide it needs to be.

Jul 07 '20 03:07 s-u

Randomly generate a secret key as part of start of session, and demand it on reconnect? I guess if you could read that key out of someone else's browser, then you could steal their session, but that already implies they are running your code somehow.

Jul 07 '20 03:07 gordonwoodhull

Will we be able to detect from the client side if a disconnect was due to dead websocket or crash? Or do we need to try to reconnect and say “oops nothing there, do you want a new session”?

Nov 28 '20 17:11 gordonwoodhull

rcloud rcloud copied to clipboard

allow re-connect when connection was lost due to network issues

rcloud
rcloud copied to clipboard