enterprise_gateway Add the option to terminate pending kubernetes kernels if they have events preventing them from starting

Problem

I am facing a problem when using JEG on kubernetes. I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels. When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels. In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.

Proposed Solution

For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them. For the other issue I’ve stated I can see two possible solutions: The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending. The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.

Dec 27 '23 20:12 OrenZ1

If I can get an update about this request, that would be great. I will be happy to contribute and add this option, so if you can state the relevant files, I can try to implement this and contribute :)

Jan 25 '24 13:01 OrenZ1

Hi @OrenZ1 - I apologize for the delay. Unfortunately, I'm unable to spend much time on EG (and Jupyter in general) these days.

I think this would be a great addition. Ideally, if we can determine that a Pending state is going to remain pending until the prescribed (and long) timeout, it would better to abort. The location where we can detect this during the startup sequence is in the KubernetesProcessProxy and the status loop where we could add more intelligence would be here.

I hope you find that helpful but imagine you've probably poked around a bit already so let me know if this isn't what you were looking for.

Thank you for your interest and helping out!

Jan 25 '24 22:01 kevin-bates

There are multiple ways you can go about this:

Configure kernel image pullers to avoid delays in downloading images and reduce the startup timeouts
Configure culling kernels to avoid kernels wasting resources
If this is related to spark? Enable dynamic allocation to help reduce idle usage of resources

Also, having what @kevin-bates proposes above would not only help your use case but also fix a file-handlers leak that I have seen in the past.

Jan 25 '24 22:01 lresende

Hi! Sorry for the delay but I managed to make a PR for the first thing we've discussed here! For now the PR is for when the kernel pod dies while still in startup, the EG will throw a matching exception to the user, to prevent the need to wait for timeout.

I am still trying to think of a way to handle kernels which are stuck on Pending state. Hope to make a different PR for that too soon :) #1370

Feb 29 '24 14:02 OrenZ1

Just created a new PR, which enables the option to configure different timeouts for different events which occur during startup, including a "0 seconds" timeout -which means the startup will terminate immediately after such event occurs. #1383

Jun 02 '24 14:06 OrenZ1

enterprise_gateway enterprise_gateway copied to clipboard

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting

Problem

Proposed Solution

enterprise_gateway
enterprise_gateway copied to clipboard