enhancement-proposals icon indicating copy to clipboard operation
enhancement-proposals copied to clipboard

Kernel handshaking pattern proposal

Open JohanMabille opened this issue 3 years ago • 25 comments

Kernel Handshaking pattern

The current implementation of Jupyter client makes it responsible for finding available ports and pass them to a new starting kernel. The issue is that a new process can start using one of these ports before the kernel has started, resulting in a ZMQError when the kernel starts. This is even more problematic when spawning a lot of kernels in a short laps of time, because the client may find available ports that have already been assigned to another kernel.

A workaround has been implemented for the latter case, but it does not solve the former one.

This proposal aims to fix this issue for good, with a hanshaking pattern.

JohanMabille avatar Jan 05 '21 15:01 JohanMabille

FWIW - this is what Enterprise Gateway does. However, I'm in the process of updating it to use a single response address to which the kernel (launchers) respond, rather than having a different response address for each kernel invocation. The reason for the update is that by having a single response address across all invocations, it introduces the possibility of having the server in one cluster, and the kernel in another (e.g., server running in Kubernetes, while the kernel is running in a Hadoop YARN cluster) because the (single) response address can be published.

BTW, although "local" kernels in EG do not use this approach, we have recommended the configuration of "distributed" kernels in loopback just to work around the timing issues you are trying to address since distributed kernels will go through this machinery despite the fact that they are indeed local to the server.

The other major difference is that EG utilizes kernel launchers which (preferably) embed the target kernel so as to avoid any kernel modifications to the dozens of kernel implementations.

Since this proposal requires updates to the kernels themselves, one suggestion I would make is to not overload the use of the connection file parameter and introduce a separate, and explicit, parameter that is mutually exclusive. I think that would help in troubleshooting and maintenance.

I would also suggest that the server (client in the above phrasing) be the thing that produces the connection-file (following receipt of the port information via the socket) not the kernel since the latter (incorrectly) assumes that kernels are always local.

kevin-bates avatar Jan 05 '21 16:01 kevin-bates

However, I'm in the process of updating it to use a single response address to which the kernel (launchers) respond, rather than having a different response address for each kernel invocation. T

It's the same in this proposal, the client opens a single socket that will be used by all launched kernels to communicate their port information.

one suggestion I would make is to not overload the use of the connection file parameter and introduce a separate, and explicit, parameter that is mutually exclusive.

I'm not sure that we can specify mutually exclusive arguments in the kernelspec.

I would also suggest that the server (client in the above phrasing) be the thing that produces the connection-file (following receipt of the port information via the socket) not the kernel since the latter (incorrectly) assumes that kernels are always local.

Indeed that would be definitely better, thanks for the suggestion!

JohanMabille avatar Jan 05 '21 16:01 JohanMabille

Thanks for the response Johan - glad to see this is a single response address proposal as well.

one suggestion I would make is to not overload the use of the connection file parameter and introduce a separate, and explicit, parameter that is mutually exclusive.

I'm not sure that we can specify mutually exclusive arguments in the kernelspec.

Ok. This overload approach probably fits better with backward-compatibility support.

If the new field "kernel_handshake" is missing from the kernelspec, the clients should emit a warning, and assume that the new mechanisms is not supported, and recommend that the key is added to the kernelspec.

Should we just let the kernel vendors add kernel_handshake when installing the kernelspec and leave it at that. Seems like if folks are using kernels that have no plans on being updated, this will cause some unnecessary noise.

Is kernel_handshake a boolean? If so, perhaps users could manually add it - with a value of False - to indicate that no warning should be produced. This approach would also allow supporting implementations to easily fallback in case issues are encountered with the new approach.

kevin-bates avatar Jan 05 '21 17:01 kevin-bates

Are domain sockets a viable approach for you to avoid port conflicts?

blois avatar Jan 05 '21 17:01 blois

Are domain sockets a viable approach for you to avoid port conflicts?

I was going to ask about Windows, but it seems that Windows now has domain sockets?

  • https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/

According to https://github.com/tokio-rs/tokio/issues/2201:

Windows 10 introduced support for unix sockets in late 2017 to Insiders builds, and the spring 2018 general Windows 10 release.

jasongrout avatar Jan 05 '21 17:01 jasongrout

Some context and thread to implement a version of this proposal: https://github.com/jupyter/jupyter_client/issues/487

I shamefully did not find time to work on the implementation before going on paternity leave, but I think it has many of the details about why and how different scenarios could be handled.

MSeal avatar Jan 05 '21 19:01 MSeal

Are domain sockets a viable approach for you to avoid port conflicts?

I don't think we want to drop support for TCP protocol. Besides, you may want to connect to a kernel from a distant client.

JohanMabille avatar Jan 05 '21 20:01 JohanMabille

Should we just let the kernel vendors add kernel_handshake when installing the kernelspec and leave it at that. Seems like if folks are using kernels that have no plans on being updated, this will cause some unnecessary noise.

We want kernels to update to the new mechanism so that we can drop the old one in the long run.

JohanMabille avatar Jan 05 '21 20:01 JohanMabille

I don't think we want to drop support for TCP protocol. Besides, you may want to connect to a kernel from a distant client.

The random port choice right now does not really work for distant clients- how common are those issues? I'm not saying that TCP should be dropped, just wondering if IPC or domain sockets can be used for your scenario to avoid the need for larger changes. We've been using IPC for this for quite a while now.

blois avatar Jan 06 '21 01:01 blois

The random port choice right now does not really work for distant clients

This is orthogonal to distant client. I mean, you could start a kernel and a client on a machine A, and want to connect to that kernel from a client running on a machine B. In that case, you cannot use IPC sockets.

how common are those issues

In the context of Voilà, we noticed up to 30% failure (i.e. notebooks not displayed).

JohanMabille avatar Jan 06 '21 09:01 JohanMabille

isn't there a missing fields that should be added to the connection file ?

and send the connection information to the client.

It's the same in this proposal, the client opens a single socket that will be used by all launched kernels to communicate their port information.

How will the client know which kernel is communicating its ports ? do you need like a kernel_handshake and a kernel_token?

Carreau avatar Jan 06 '21 17:01 Carreau

How will the client know which kernel is communicating its ports ? do you need like a kernel_handshake and a kernel_token?

EG sends the kernel_id to the kernel and it is returned to handle interleaved launches/responses.

I guess I still have a concern about using the connection file to convey the response address as it immediately eliminates remote kernels. I think the response address (and whatever else is required) needs to be an argument to the kernel - just as the connection filename is.

Also, are there plans for encrypting the response?

kevin-bates avatar Jan 06 '21 18:01 kevin-bates

EG sends the kernel_id to the kernel and it is returned to handle interleaved launches/responses.

How does it sends it ?

I guess I still have a concern about using the connection file to convey the response address as it immediately eliminates remote kernels.

I'm not sure I understand why this eliminates remote kernel. If the response adresse is 192.131.21.12:7737 remote kernel can still ping back to the server; or is there something I misunderstood ?

I'm also not sure what the difference is with the response adresse being part of the connection file – or the kernel argument changes.

Also, are there plans for encrypting the response?

At that point I'm not sure we can encrypt but at least we may want to authenticate maybe.

Carreau avatar Jan 06 '21 18:01 Carreau

EG sends the kernel_id to the kernel and it is returned to handle interleaved launches/responses.

How does it sends it ?

As an explicit argument to the launcher: https://github.com/jupyter/enterprise_gateway/blob/master/etc/kernelspecs/spark_python_kubernetes/kernel.json#L21-L22

I'm not sure I understand why this eliminates remote kernel. If the response adresse is 192.131.21.12:7737 remote kernel can still ping back to the server; or is there something I misunderstood ?

Currently, it sounds like the connection file will contain the response address, but the file is local - so the file's contents are not available to the remote kernel and, as a result, it won't know to respond to 192.131.21.12:7737 - all it knows is the name of the connection file (which isn't very helpful and won't exist).

Also, are there plans for encrypting the response?

At that point I'm not sure we can encrypt but at least we may want to authenticate maybe.

EG also sends a public key (via another parameter) that is used by the kernel (launcher) to encrypt an AES key (to encrypt the response). On return, the AES key is decrypted using the private key and then the payload is decrypted using the AES key.

I know this is all probably more than you want but I'm just sharing experiences (and trying to ensure our functionality can somehow continue to exist in some manner).

kevin-bates avatar Jan 06 '21 19:01 kevin-bates

How will the client know which kernel is communicating its ports ? do you need like a kernel_handshake and a kernel_token?

Indeed, the client should add a kernel_token to the connection file so that it can identifies which kernel is communicating its ports.

I guess I still have a concern about using the connection file to convey the response address as it immediately eliminates remote kernels. I think the response address (and whatever else is required) needs to be an argument to the kernel - just as the connection filename is.

If you pass the reponse address as an argument to the kernel, and if that kernel is remote, you need to send the response address from the client to the remote machine that will launch the kernel. It's the same with the connection file, you need to send the connection file to the remote machine where it can be dumped. In both cases, you need an intermediate actor on the remote machine.

Since the kernel command line alreay accepts a conneciton file, and since we cannot specify mutually exclusive arguments in the kernelspec, I think we should stick keep using the conneciton file.

At that point I'm not sure we can encrypt but at least we may want to authenticate maybe.

Indeed, since the connection file contains the signature shceme and the key.

I know this is all probably more than you want but I'm just sharing experiences (and trying to ensure our functionality can somehow continue to exist in some manner).

Thanks for sharing, this is much apreciated! I don't think this change will break your functionality (although there might be subtlety that I missed): the kernel is still started with a connection file, the only difference is that the kernel will communicate back its port to the client. The connection file still has to be transferred from the client machine to the kernel machine if the kernel is remote (via a gateway or any other intermediate agent).

JohanMabille avatar Jan 06 '21 20:01 JohanMabille

The connection file still has to be transferred from the client machine to the kernel machine if the kernel is remote (via a gateway or any other intermediate agent).

Not true. With EG we communicate with a resource manager (kubernetes, docker swarm, hadoop YARN, etc.) to start a launcher that embeds the kernel. The launcher will ultimately produce a connection file to be consumed by its associated kernel (and local to the kernel) but all interaction between the server (client) and the launcher (kernel) takes place via arguments and the launcher code is responsible for encrypting and sending the response (thereby preventing any updates to the associated kernel). For the container-based resource managers, the arguments are passed via environment variables. No file transfers are necessary - particularly because it's not even known at the time of launch on which node the resource manager will decide to run the launcher/kernel.

At any rate, thanks for working on this and listening.

kevin-bates avatar Jan 06 '21 22:01 kevin-bates

The launcher will ultimately produce a connection file to be consumed by its associated kernel

What does prevent you from using the mechanism you already have to pass the additional arguments (i.e. client address and dedicated socket port + kernel token) to the launcher and having it dumping the connection file with these new parameters?

JohanMabille avatar Jan 06 '21 23:01 JohanMabille

That's not that big of a deal - although I was hoping (assuming I could move to this proposal) not having to open the "connection file" to extract the pieces as arguments (and believe it's better for the sake of maintainability to have them separate anyway).

If I'm understanding things, here's what the launcher would then need to do.

  1. Since the connection file argument is overloaded, the launcher will first need to determine if it's talking to a kernel that understands the handshake because what the kernelspec on the server states may not match the kernel installed on the remote node. Not sure how to best do this. For the sake of argument, let's "trust" that the configuration is proper and that if the kernelspec contains the "handshake" entry, then the remote kernel is "handshake-aware".
  2. If the launcher is talking to a "handshake-aware" kernel, yes, it would produce a "connection file" which the kernel then responds against. However, that response will be in a completely different format (non-encrypted for one) and sent to a different listener - and EG will need to figure out how to interact with that.
  3. If the launcher is talking to a "legacy" kernel (that doesn't know about handshakes - although I'm not sure how that's determined), it would do as it does today: produce the ports, encrypt the response and send it to a different listener - one that knows how to decrypt the response, and start the kernel.

I think it's imperative really, that a) the response parameters be conveyed via a different argument than the connection file and b) that kernels updated to use this approach still retain today's behavior - where the connection file contains the actual connection information. That way, applications that have already addressed port conflict issues can still operate and items listed above will no longer apply.

I think at this point, it would be best to meet online where we can probably communicate more interactively. I am willing to adjust for timezone differences - preferring mornings (Pacific TZ). I suspect it's very late for you right now and I appreciate you taking this extra time. If you want to discuss things tomorrow, just ping me in your afternoon. I'm usually at my desk by 7 am PST.

FWIW, I have a proposal (in my head for too long) similar to kernel providers but built into jupyter_client where the Popen process interactions are abstracted. I think if that layer were responsible for handling the connection - making KernelManager really just an application-level entity, then we could easily support IPC, TCP (with handshake), TCP (remote), Kubernetes, etc. as part of the base jupyter framework. This is essentially what EG does with process-proxies but there's no reason this framework can't be made available to any jupyter_client application.

kevin-bates avatar Jan 07 '21 00:01 kevin-bates

I think we should decompose the what to pass to the kernel and the how, I think we agree on what.

A warning against CLI parameter; you do not want to pass security sensitive items via command line arguments as those might be leaked in process list to other users; hence why usually you want an env variable or a file.

Many CVEs, on notebook server were due to open chrome with a URL having a token and this being leaked to other process/users even non-privileged;

Side note; non-handshake aware kernel could be wrapped in a mini handshake aware proxy on remote host to hide details, might have to jump through hoops, but doable.

Carreau avatar Jan 07 '21 01:01 Carreau

In my opinion, passing the handshake port in the connection file is the appropriate route.

  • other information than port numbers must be passed (tcp/ipc, the message signature scheme and key, and the client address). All of it except for the channel/handshake port is common with the current connection file.
  • the kernelspec can only have one command to start the kernel, so it should be the same regardless of the handshake mechanism if we need to support both.

In my opinion, @JohanMabille's proposal is the appropriate course of action to fix this issue. Note that it has to be fixed as the current kernel startup is inherently broken.

We would be happy to connect to discuss this in greater detail.

SylvainCorlay avatar Jan 07 '21 09:01 SylvainCorlay

because what the kernelspec on the server states may not match the kernel installed on the remote node. Not sure how to best do this.

This (the kernelspec not matching the installed kernel) looks like a deployment problem independent from this JEP. This mismatch could already exist regarding the arguments that a kernel accepts for instance.

If the launcher is talking to a "handshake-aware" kernel, yes, it would produce a "connection file" which the kernel then responds against. However, that response will be in a completely different format (non-encrypted for one) and sent to a different listener - and EG will need to figure out how to interact with that.

What does prevent you from having a proxy between the client and the kernel and passing the address of this proxy in the connection file?

I think it's imperative really, that a) the response parameters be conveyed via a different argument than the connection file

I still don't understand why this would be problematic since you already need to "transfer" the information somehow (transfering the connection file or passing data to a resource manager and then generating the connection file from this data on the remote machine is an implementation detail).

FWIW, I have a proposal (in my head for too long) similar to kernel providers but built into jupyter_client where the Popen process interactions are abstracted. I think if that layer were responsible for handling the connection - making KernelManager really just an application-level entity, then we could easily support IPC, TCP (with handshake), TCP (remote), Kubernetes, etc. as part of the base jupyter framework

This is specific to jupyter_client, which is a Python implementation of a client in the Jupyter ecosystem. I think the specification should remain agnostic to the languages.

I think at this point, it would be best to meet online where we can probably communicate more interactively

I agree. What's the best medium to ping you?

JohanMabille avatar Jan 07 '21 11:01 JohanMabille

Matthias is correct, we should focus on the what instead of the how. I'm sorry this has diverged but I'm just trying to make these two implementations (EG's and this proposal) as compatible as possible.

What's the best medium to ping you?

Thanks. Please use my email address associated with my GH profile.

kevin-bates avatar Jan 07 '21 14:01 kevin-bates

@kevin-bates I've just updated the JEP based on the discussion we had today and the feedback from @Carreau . Please let me know if I missed something.

JohanMabille avatar Jan 09 '21 00:01 JohanMabille

Hi Johan, I feel I owe a response here and I apologize for the delay.

I think the proposal accomplishes its purpose - to address race conditions relative to local kernels in multi-kernel applications or port-intensive environments in general. Requiring all kernels to be updated to accomplish this seems a little much, but that's my opinion. I'd like to see us (Jupyter) always allow for the application to optionally specify a kernel's ports and was glad to see the retraction of the third value from kernel_startup_protocol_version.

That said, how a kernel determines its ports and how those ports are communicated should, in my opinion, be a point of integration. This is the basis for the Kernel Environment Provisioning proposal. I think this proposal would be an ideal candidate to include in the default provisioner. In the KEP proposal, the default provisioner is used in the absence of any provisioner (or when it is explicitly specified). As a result, the default provisioner would be used for all existing kernel specifications. We could then add kernel_startup_protocol_version as an attribute of the default provisioner - currently named ClientProvisioner - for those kernelspecs/applications that want to use the handshake. Other environment provisioners can then co-exist, side-by-side, in which these provisioners implement their own means of determining and communicating a kernel's connection information - independent of kernel_startup_protocol_version. Provisioner authors that would like to utilize the new startup protocol would simply derive from ClientProvisioner.

I would be happy to help in this context.

kevin-bates avatar Feb 04 '21 01:02 kevin-bates

Are there enough IPv6 [localhost] addresses to avoid port conflicts?

On Wed, Feb 3, 2021, 20:47 Kevin Bates [email protected] wrote:

Hi Johan, I feel I owe a response here and I apologize for the delay.

I think the proposal accomplishes its purpose - to address race conditions relative to local kernels in multi-kernel applications or port-intensive environments in general. Requiring all kernels to be updated to accomplish this seems a little much, but that's my opinion. I'd like to see us (Jupyter) always allow for the application to optionally specify a kernel's ports and was glad to see the retraction of the third value from kernel_startup_protocol_version.

That said, how a kernel determines its ports and how those ports are communicated should, in my opinion, be a point of integration. This is the basis for the Kernel Environment Provisioning https://github.com/jupyter/jupyter_client/issues/608 proposal. I think this proposal would be an ideal candidate to include in the default provisioner. In the KEP proposal, the default provisioner is used in the absence of any provisioner (or when it is explicitly specified). As a result, the default provisioner would be used for all existing kernel specifications. We could then add kernel_startup_protocol_version as an attribute of the default provisioner - currently named ClientProvisioner

  • for those kernelspecs/applications that want to use the handshake. Other environment provisioners can then co-exist, side-by-side, in which these provisioners implement their own means of determining and communicating a kernel's connection information - independent of kernel_startup_protocol_version. Provisioner authors that would like to utilize the new startup protocol would simply derive from ClientProvisioner.

I would be happy to help in this context.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jupyter/enhancement-proposals/pull/66#issuecomment-772962143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNSZ6UQ476JZ5VYW3LZTS5H4E3ANCNFSM4VVK7NVA .

westurner avatar Feb 04 '21 03:02 westurner