kubeflow icon indicating copy to clipboard operation
kubeflow copied to clipboard

Notebook Culling for VSCode/CodeServer style notebooks

Open tu-infra-tests opened this issue 2 years ago • 10 comments

/kind feature

Why you need this feature: We use VSCode/CodeServer style notebooks (https://github.com/coder/code-server) where it is critical to have a mechanism to terminate underutilized resources. If such a mechanism does not exist, it can be easy to have notebooks bound to expensive GPU resources even while no one is actively using them.

Describe the solution you'd like: For Jupyter style notebooks there is a notebook culling feature where underutilized resources can be automatically terminated by the notebook controller if unused for a certain amount of time. It does not appear this feature works when using VSCode style notebooks, but the same type of feature would solve this problem as well.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

tu-infra-tests avatar Jan 30 '23 18:01 tu-infra-tests

I've sent a PR to extend the existing implementation of the culling controller (https://github.com/kubeflow/kubeflow/pull/6807 and proposals/20220121-jupyter-notebook-idleness.md) to extract Istio HTTP metrics from VSCode and RStudio Νotebook Pods and use them to decide on their idleness.

apo-ger avatar Jan 31 '23 19:01 apo-ger

@apo-ger will this also work for vscode and rstudio within jupyterlab via jupyter-server-proxy?

juliusvonkohout avatar Feb 01 '23 10:02 juliusvonkohout

https://github.com/kubeflow/kubeflow/issues/7186 is related

juliusvonkohout avatar Jul 03 '23 13:07 juliusvonkohout

I believe This is due to the Culling controller is only considering notebook kernel's last_activity but not terminal's.

Here's my proposal for fixing this issue:

User shall be able to config Culling controller to consider terminals workload by adding options to the notebook-controller-config, for example:

CULL_OPTION: KERNEL | TERMINAL | BOTH | NETWORK
--------
# KERNEL corresponds to: `api/kernels`
# TERMINAL corresponds to: `api/terminals`
# BOTH took above 2 into consideration
# NETWORK corresponds to: `api/status`

wjhhuizi avatar Jul 03 '23 15:07 wjhhuizi

@wjhhuizi api/status doesn't work due to components/proposals/20220121-jupyter-notebook-idleness.md

check the ## Alternative Considered Approaches section

wadhah101 avatar Jul 03 '23 18:07 wadhah101

@wjhhuizi api/status doesn't work due to components/proposals/20220121-jupyter-notebook-idleness.md

check the ## Alternative Considered Approaches section

It seems that proposals was created 2 years ago, and I tested running a long execution code block in the notebook then close the browser tab, it seems the last_activity from api/status is still updating properly.

Code Tested:

import time
for i in range(60):
    print (i)
    time.sleep(1)

wjhhuizi avatar Jul 03 '23 20:07 wjhhuizi

Okay it seems if I remove the line print (i) then the last_activity won't get updated for neither api/kernels nor api/status, however the execution_state in api/kernels would remains busy. This might be point I think...

wjhhuizi avatar Jul 03 '23 20:07 wjhhuizi

Updating proposal to

CULL_OPTION: KERNEL | TERMINAL | BOTH | NETWORK | NETWORK+
--------
# KERNEL corresponds to: `api/kernels`
# TERMINAL corresponds to: `api/terminals`
# BOTH took above 2 into consideration
# NETWORK corresponds to: `api/status`
# NETWORK+ took consideration on both api/status and KERNEL idleness

wjhhuizi avatar Jul 03 '23 20:07 wjhhuizi

We will probably just replace this one here with https://github.com/kubeflow/kubeflow/issues/7156

juliusvonkohout avatar Sep 27 '23 10:09 juliusvonkohout

+1

em-le-ts avatar Nov 03 '23 02:11 em-le-ts