vscode-jupyter icon indicating copy to clipboard operation
vscode-jupyter copied to clipboard

Persistent Jupyter Kernels - Restore/Re-connect to an existing local/remote kernel (do not shutdown kernel upon closing/reloading vscode/vscode.dev)

Open DonJayamanne opened this issue 4 years ago • 59 comments

Problem

Local

  • User starts a notebook, the kernel is now running on the local machine
  • Assume the computer goes to sleep,
  • After a while if we go back into the notebook, the Notebook is unable to re-connect to the same kernel (kernel state is lost)
  • Similarly if user re-loads VS Code, the notebook is unable to re-connect to the same kernel (kernel state is lost)
  • Similarly if a user is using Remote SSH, and connection is reset and user re-connects and opens the same notebook, then the user is unable to re-connect to the same kernel (kernel state is lost)

Remote

  • User opens a notebook and runs a cell against a remote kernel
  • Similarly if user re-loads VS Code (or vscode.dev), the notebook is unable to re-connect to the same kernel (kernel state is lost)

Investigation Running Server & JupyterLab API for extensibility Goals:

  • Long running kernels
    • User can open a notebook with a cell thats still running & see the output being generated
    • Same as User can open notebooks related to kernels that are still running.
  • Extensibility for extension authors This is a by-product of the long running kernels (i.e. you get this for free - almost)

Planned (related) Prototypes

  • Long running kernels Solve problems related to kernel/session being lost due to :
    • VS Code Shutdown
    • VS Code Restart
    • Computer sleeping
    • SSH connection issues
    • AML Compute will benefit
  • By product of extensiblity
    • Julia Widgets (i might end up doing this first, might be easier)
    • IPyWidget outside notebooks
    • Variable viewer using the new api
    • Data Frame viewer using the new api

Technical details

  • Server Background process
    Manages kernels & sessions
    Expose kernel socket connection over this connection (we already have the code/technology for this) - proxy socket (dummy kernel in UI layer, by creating a dummy socket connection)
    Security - how do we secure this web server (will need to be addressed, but i'm leaving that for later)
  • Expose Jupyter extension extensibility over Jupyter Lab API I wont be exposing a connection, instead will just expose the SessionManager, KernelManager & other class instances from extension API

Also related https://github.com/microsoft/vscode-jupyter/issues/300

DonJayamanne avatar Nov 20 '20 22:11 DonJayamanne

Is there a milestone issue to see the progress of the update?

matifali avatar Jan 17 '23 09:01 matifali

Unfortunately this issue has not yet been prioritized at our end, please do vote on this issue though

DonJayamanne avatar Jan 17 '23 20:01 DonJayamanne

What do you suggest as a workaround if one wants to run long 10+ hours sessions using Jupyter notebooks in vscode when connected to a remote kernel over SSH (using vscode remote extension)? after some hours the connection gets disconnected and there is no way to see the progress or output of running cells.

matifali avatar Jan 18 '23 05:01 matifali

@matifali unfortunately at this stage we have no work around for this, let me see if i can get an udpate within a week.

DonJayamanne avatar Jan 18 '23 19:01 DonJayamanne

@matifali I'm trying to understand your expectations, hence the following questions

  • Assume you have 1 cell *Code in this cell prints numbers from 1 to 100, printing a number every hour
  • Assume you run this cell and saw the number 1 printed out.
  • Now you run this cell, and close vscode and come back tomorrow and open vscode and open this asme notebook
  • Would you expect to see the numbers 1, 2, 3, 4 and then slowly the number going up to 100 while vscode is open (as the execution is still in progress)
  • Or would you expect to see 1, 80, 81, 82 and then the number will keep going up while vscode is open (as the execution is still in progress)
  • Assume you have opened vscode after a few hours and you know all 100 would have been printed out and vscode was closed. Would you expect to see all 1, 2, ... 100 in the output or just expect to be able to connect to the kernel and see the fact that execution has completed

I ask this because the easiest thing to get working is:

  • if the cell is still running then we display 1, 80, 81, 82 (where 1 was from the first instance of vscode and 80, 81 and so on after vscode is opened again. I.e. all of the output generated while vscode was closed will not becaptured and not stored in teh notebook)
  • I.e. we will only allow connecting to a kernel and you can see whether exeuction has comlpeted {or not, and if it is still going on then the data will be appened to what was stored previously

Thanks

DonJayamanne avatar Jan 20 '23 04:01 DonJayamanne

  • Would you expect to see the numbers 1, 2, 3, 4 and then slowly the number going up to 100 while vscode is open (as the execution is still in progress)

I would prefer this output as my use case is to train deep learning models and its better if we can see the full history.

Assume you have opened vscode after a few hours and you know all 100 would have been printed out and vscode was closed. Would you expect to see all 1, 2, ... 100 in the output or just expect to be able to connect to the kernel and see the fact that execution has completed

This is preferred,

Or would you expect to see 1, 80, 81, 82 and then the number will keep going up while vscode is open (as the execution is still in progress)

This is also OK but the problem is vscode is unable to connect to a running remote kernel and show any outputs. Yes, the process is running but we do not see anything printed. There is no indication if the losses are actually decreasing.

matifali avatar Jan 20 '23 05:01 matifali

https://github.com/matifali please could you provide a simple notebook that we can use for testing purposes to ensure we have a simple sample close to real world scenario

it could be a simple training model to make things simple I’d like to see what kind of out put you are using and the structure of the notebook

if possible I’d really appreciate a simple notebook without any external dependencies other than putting packages (ie without csv or other files)

once again thanks for going back with the details

DonJayamanne avatar Jan 20 '23 11:01 DonJayamanne

https://github.com/matifali please could you provide a simple notebook that we can use for testing purposes to ensure we have a simple sample close to real world scenario

it could be a simple training model to make things simple I’d like to see what kind of out put you are using and the structure of the notebook

if possible I’d really appreciate a simple notebook without any external dependencies other than putting packages (ie without csv or other files)

once again thanks for going back with the details

I have made this simple toy notebook that trains a DNN classifier with randomly generated data. I have tried to replicate the essence of a real ML scientist/engineer's workflow. There are no external dependencies other than the necessary packages, which can be installed with the following commands: pip install tensorflow pip install numpy pip install scikit-learn

The structure of the notebook follows is a standard format for training ML models:

  • Importing necessary packages.
  • Loading and processing (generating, in this case) the data.
  • Defining the model architecture.
  • Training and validation of the model.

The last cell is the most important for testing the reconnection mechanisms, as this is the part where the training loop is run and the result is displayed. You will see the number of epochs, the loss and the accuracy of the model being printed as the training progresses. I have defined a very high number of epochs so that you have plenty of time to test the reconnection mechanisms even if the training has not yet been completed. Ideally, we would like to see the complete training history (all the lines that are printed when the last cell is run).

For my use cases, model training can take days, even weeks, and what I have found is that I cannot leave this kind of notebook running and exit VS Code because otherwise the process dies immediately when I close the window. Allowing the process to keep running in the background is a necessary first step for the reconnect mechanism to make sense to ML scientists/engineers, especially laptop users like me.

You can find the notebook in the following repository: https://github.com/RYSKZ/Toy-DNN-Training

Please let me know if you have any issues or need further clarification.

FaintWhisper avatar Jan 25 '23 18:01 FaintWhisper

@DonJayamanne, the above notebook seems a good fit for the test.

matifali avatar Jan 30 '23 08:01 matifali

bumping. any movement on this?

dmarx avatar Feb 16 '23 09:02 dmarx

https://github.com/matifali please could you provide a simple notebook that we can use for testing purposes to ensure we have a simple sample close to real world scenario it could be a simple training model to make things simple I’d like to see what kind of out put you are using and the structure of the notebook if possible I’d really appreciate a simple notebook without any external dependencies other than putting packages (ie without csv or other files) once again thanks for going back with the details

I have made this simple toy notebook that trains a DNN classifier with randomly generated data. I have tried to replicate the essence of a real ML scientist/engineer's workflow. There are no external dependencies other than the necessary packages, which can be installed with the following commands: pip install tensorflow pip install numpy pip install scikit-learn

The structure of the notebook follows is a standard format for training ML models:

  • Importing necessary packages.
  • Loading and processing (generating, in this case) the data.
  • Defining the model architecture.
  • Training and validation of the model.

The last cell is the most important for testing the reconnection mechanisms, as this is the part where the training loop is run and the result is displayed. You will see the number of epochs, the loss and the accuracy of the model being printed as the training progresses. I have defined a very high number of epochs so that you have plenty of time to test the reconnection mechanisms even if the training has not yet been completed. Ideally, we would like to see the complete training history (all the lines that are printed when the last cell is run).

For my use cases, model training can take days, even weeks, and what I have found is that I cannot leave this kind of notebook running and exit VS Code because otherwise the process dies immediately when I close the window. Allowing the process to keep running in the background is a necessary first step for the reconnect mechanism to make sense to ML scientists/engineers, especially laptop users like me.

You can find the notebook in the following repository: https://github.com/RYSKZ/Toy-DNN-Training

Please let me know if you have any issues or need further clarification.

@DonJayamanne You may use this notebook for testing.

matifali avatar Feb 16 '23 10:02 matifali

@matifali I'm trying to understand your expectations, hence the following questions

  • Assume you have 1 cell *Code in this cell prints numbers from 1 to 100, printing a number every hour
  • Assume you run this cell and saw the number 1 printed out.
  • Now you run this cell, and close vscode and come back tomorrow and open vscode and open this asme notebook
  • Would you expect to see the numbers 1, 2, 3, 4 and then slowly the number going up to 100 while vscode is open (as the execution is still in progress)
  • Or would you expect to see 1, 80, 81, 82 and then the number will keep going up while vscode is open (as the execution is still in progress)
  • Assume you have opened vscode after a few hours and you know all 100 would have been printed out and vscode was closed. Would you expect to see all 1, 2, ... 100 in the output or just expect to be able to connect to the kernel and see the fact that execution has completed

I ask this because the easiest thing to get working is:

  • if the cell is still running then we display 1, 80, 81, 82 (where 1 was from the first instance of vscode and 80, 81 and so on after vscode is opened again. I.e. all of the output generated while vscode was closed will not becaptured and not stored in teh notebook)
  • I.e. we will only allow connecting to a kernel and you can see whether exeuction has comlpeted {or not, and if it is still going on then the data will be appened to what was stored previously

Thanks

the fundamental issue here is that jupyter server shows the available "running kernels" that can be reconnected to, and vscode doesn't. you could get around the complexities of expected behavior wrt specific cell outputs if you just made the already-running kernels visible to the user somehow.

concretely: i have a GPU equipped workstation and use it to run image generation notebooks, often from my laptop connected VIA vscode's "ssh remote" functionality. new images appear in the cell output as they are generated, but they are also written to disk (on the workstation). if the screen on my laptop goes to sleep, vscode prompts me to re-enter the password for my remote and responds by creating a new jupyter session. the old session is still running, as evidenced by outputs continuing to be written to disk and ps aux showing the old jupyter PID still there and consuming lots of resources (to be clear: vscode sometimes kills the running session after I start a new one, but this behavior seems inconsistent and i often either leave the background job to completion or sigkill it manually myself to regain visibility of outputs). as a user, I should be able to pick the existing, running kernel from the "select kernel" drop down, but it is not available. this is a basic jupyter feature and it should not be difficult to expose it. it would be nice if vscode "intelligently" reconnected itself, but right now there's literally no option to reconnect to the old kernel at all, automagically or manually. vs code just needs to expose visibility on the already running kernels it's managing, rather than only listing the kinds of kernels it's capable of initiating

@DonJayamanne

dmarx avatar Feb 16 '23 17:02 dmarx

Any update on this?

As far as I understand it is not possible to start running a jupyter notebook on a remote machine via the VSCode SSH extension, disconnect from the SSH tunnel and come back to the notebook still running.

I have tried with tmux but i dont find a way to have the jupyter notebook show up on VSCode after reattaching to the running tmux session.

Anyone could give a hand?

marcoBmota8 avatar Feb 24 '23 20:02 marcoBmota8

+1

andreimargeloiu avatar Apr 02 '23 10:04 andreimargeloiu

I'd heavily rely on this feature. Any updates on this? Or viable workarounds?

bbantal avatar Apr 30 '23 19:04 bbantal

@bbantal

As a workaround, I have succeeded in running my own jupyter server process and connecting to that as a "remote" kernel (running on the same host). As long as the jupyter server process is running the state of your kernel is persisted across VS Code restarts.

jrich100 avatar May 01 '23 17:05 jrich100

@bbantal

As a workaround, I have succeeded in running my own jupyter server process and connecting to that as a "remote" kernel (running on the same host). As long as the jupyter server process is running the state of your kernel is persisted across VS Code restarts.

By your own jupyter server do you mean a second jupyter server that you run on your local machine? As in "remote jupyter server" -> "local jupyter server" -> "local VS code session"?

@jrich100

bbantal avatar May 01 '23 21:05 bbantal

@bbantal

We run this process (on the same machine where VS Code is running). Then, when selecting a kernel in VS Code, you can choose to connect to a remote jupyter server. Here you can specify the URL generated by the notebook process

jrich100 avatar May 02 '23 13:05 jrich100

@bbantal

We run this process (on the same machine where VS Code is running). Then, when selecting a kernel in VS Code, you can choose to connect to a remote jupyter server. Here you can specify the URL generated by the notebook process

@jrich100

Unclear to me how my desired remote jupyter server is involved in your solution. What am I missing? I want to connect to a remote (not local!) jupyter server from my local VS code and I want to keep the kernel on that remote server alive so that I can reconnect to it whenever and access my previously created variables. The issue is currently that kernel dies whenever I close VS code.

bbantal avatar May 02 '23 20:05 bbantal

The issue is currently that kernel dies whenever I close VS code.

This should not happen, if it does its a bug, I think by I want to connect to a remote you mean you are connecting to the remote server with VS Code over SSH or the like, is that correct? If thats the case, then yes the kernels will die when VS Code is closed.

DonJayamanne avatar May 02 '23 21:05 DonJayamanne

I think by I want to connect to a remote you mean you are connecting to the remote server with VS Code over SSH or the like, is that correct? If thats the case, then yes the kernels will die when VS Code is closed.

@DonJayamanne

Yes, that's exactly what I was trying to articulate! Ideally, the kernel wouldn't die and I could just reconnect to it whenever as long as it's kept running on the remote server. This feature would be immensely useful to me, and from what I can tell, to many others as well. Hence why I wondered if there were any updates, or alternatively a temporary workaround.

bbantal avatar May 02 '23 22:05 bbantal

It feature would be very useful for many of users. Because it is just simply common sense, that If the ssh connection is closed for some reason, we want to able to after reconnect have the same state of the kernel and cells, cause even after reloading VS Code or just reconnecting ssh I can just lose all of my work and code that I made in the cells, because the kernel went down and I forgot to do ctrl+s every 5 minutes.

I think it is not so difficult - just create some kernel in a remote fashion that is not relying on the current ssh connection, and after reloading ssh or entire vs code just propose to choose existing running kernels.

metya avatar Aug 18 '23 08:08 metya

Another requirement for this https://github.com/microsoft/vscode-jupyter/issues/14446#issuecomment-1757045873

DonJayamanne avatar Oct 11 '23 07:10 DonJayamanne

I have to say it should be a crucial feature for visual studio code now. Currently losing connection to the remote tunnels means losing all of your work/progress makes it hard to do almost all important work.

AnakinShieh avatar Nov 18 '23 08:11 AnakinShieh

I'd love this. This is my biggest pain point with vscode.

rsargent avatar Nov 21 '23 18:11 rsargent

One more thing to note:

In practice, many of us are running/testing/benchmarking research code, whose various levels of maintenance (I pulled a python 2 repo the other day) mean that project-specific dev containers are pretty common.

The upshot is that the remote kernel for any given notebook is running inside the dev container for that project so that it can make use of the relevant environment.

This results in the following workflow:

  1. set up a project in a dev container on some workstation or possibly hpc allocation
  2. open up laptop and remote-ssh to the workstation
  3. open project folder in container
  4. open notebook.ipynb in project
  5. start kernel in that project environment
  6. be able to connect and reconnect to that kernel as above

I don't know if that makes implementing this insanely important feature more or less complicated....

Last of all thanks @DonJayamanne (and everyone else) for your awesome work making vscode better every day for python!

mkarikom avatar Nov 21 '23 21:11 mkarikom

Here I am having some similar work scenario like @mkarikom. I have to deal with some nasty python environments whose setup might only be possible via container (which is quite common in academia), which results that I can only use remote kernels. But for now the pylance support for remote kernel is broken so the dev experience is not optimal.

I used to mount the container image and point the python extension interpreter path setting to the interpreter inside the container mount. but now this is impossible as python interpreter path setting can influnce the behaviour of jupyter extension is considered a bug and has been fixed.

TTTPOB avatar Dec 08 '23 08:12 TTTPOB

I'd like to bump this issue. For me this is a breaking feature, and I use jupyterlab over vscode for this reason, despite vscode having a better linter, copilot, and better vim keybindings; I suspect many people who have any kind of remote data science/machine learning workflow feel similarly. I have had this issue for the past 2 years, but only just found this thread.

For what it's worth, I am willing to volunteer to help address this. I am not sure what the policy is for accepting pull requests from those outside the core team, but I thought I'd put that out there.

ando600 avatar Dec 30 '23 04:12 ando600

+1, this is a breaking feature for anyone doing research and quantitative work where we need to rapidly experiment until we find what works well so that we can port it into a standalone script.

andreimargeloiu avatar Jan 11 '24 13:01 andreimargeloiu

Maybe most people are already aware of this workaround, but here's what I do:

  1. open a bash terminal session on the remote machine
  2. run tmux on that
  3. run ipython inside that tmux session

man-shu avatar Jan 19 '24 14:01 man-shu