enterprise_gateway
enterprise_gateway copied to clipboard
Add the option to terminate pending kubernetes kernels if they have events preventing them from starting
Problem
I am facing a problem when using JEG on kubernetes. I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels. When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels. In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.
Proposed Solution
For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them. For the other issue I’ve stated I can see two possible solutions: The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending. The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.
If I can get an update about this request, that would be great. I will be happy to contribute and add this option, so if you can state the relevant files, I can try to implement this and contribute :)
Hi @OrenZ1 - I apologize for the delay. Unfortunately, I'm unable to spend much time on EG (and Jupyter in general) these days.
I think this would be a great addition. Ideally, if we can determine that a Pending state is going to remain pending until the prescribed (and long) timeout, it would better to abort. The location where we can detect this during the startup sequence is in the KubernetesProcessProxy and the status loop where we could add more intelligence would be here.
I hope you find that helpful but imagine you've probably poked around a bit already so let me know if this isn't what you were looking for.
Thank you for your interest and helping out!
There are multiple ways you can go about this:
- Configure kernel image pullers to avoid delays in downloading images and reduce the startup timeouts
- Configure culling kernels to avoid kernels wasting resources
- If this is related to spark? Enable dynamic allocation to help reduce idle usage of resources
Also, having what @kevin-bates proposes above would not only help your use case but also fix a file-handlers leak that I have seen in the past.
Hi! Sorry for the delay but I managed to make a PR for the first thing we've discussed here! For now the PR is for when the kernel pod dies while still in startup, the EG will throw a matching exception to the user, to prevent the need to wait for timeout.
I am still trying to think of a way to handle kernels which are stuck on Pending state. Hope to make a different PR for that too soon :) #1370
Just created a new PR, which enables the option to configure different timeouts for different events which occur during startup, including a "0 seconds" timeout -which means the startup will terminate immediately after such event occurs. #1383