vscode-remote-release WSL 2 connection lost after too much time in Modern Standby

VS Code Version: 1.47.3
Local OS Version: Windows 10 19042.421
Remote OS Version: Ubuntu 18.04
Remote Extension/Connection Type: WSL

Steps to Reproduce:

Start WSL 2 on a device with Modern Standby, e.g. a Surface.
Put the device into Modern Standby for several hours, e.g. overnight.
Wake up the device after several hours, e.g. in the morning.

In the morning, VS Code says it's lost its connection with WSL and needs to reload the window. However, calling wsl -l -v shows that the WSL 2 VM is indeed still running.

I've written about Modern Standby specifically because I don't know if this behavior occurs with normal standby as well, but I imagine it's probably specific to Modern Standby because Windows is still active there, but not during normal standby afaik.

Jul 31 '20 10:07 adrianghc

@aeschli Is this just another variation of challenges with reconnecting after sleep?

Aug 01 '20 00:08 Chuxel

Seems like the WSL remote extension is using TCP/IP for the connectivity to the WSL 2 container. When the system enters modern standby, Hyper-V will pause running VMs and Hyper-V-VmSwitch will "disconnect" itself from host. On a system with plenty of virtual switches, Hyper-V will sometimes takes a full minute to restore connectivity of all of those virtual switches, further exaggerating this problem.

Replacing TCP/IP sockets with the Hyper-V sockets (AF_VSOCK on the Linux side and AF_HYPERV on the Windows side) instead of the regular TCP/IP should solve this problem.

Sep 09 '20 07:09 jamestut

Let see if increasing the timeout helps

Sep 09 '20 14:09 aeschli

I wasn't able to reproduce the issue.

Sep 28 '20 20:09 aeschli

Lately I have found that this issue does not always occur, or at least it seems to happens less frequently. Perhaps something seemingly unrelated somehow mitigated this to a degree. I'll try to see if I find a pattern, and will try to get a better idea of how often it still occurs.

Sep 28 '20 20:09 adrianghc

Issue happened (much) less often if I can get Hyper-V virtual switches to wake up faster, in case of short modern standby sessions. (e.g. the typical duration of lunch break, commuting, etc.). Several ways that I can do to speed this up are:

Reducing the number of Windows Firewall rules.
The Windows Firewall rules on Windows 10 can easily reach thousands of entries, because they're automatically created every time an app tries to bind and listen to a TCP/UDP port.
Disabling the NetSetupSvc after WSL 2 is started.
This service is responsible for setting up the "vEthernet (WSL)", so if this service is disabled at first system boot, WSL 2 will fail to run. However, after WSL 2 has ever been started, keeping this service enable somehow causes a temporary packet loss between WSL 2 and Windows host around 1 to 2 seconds after the system wakes up from modern standby, for around 1 to 2 seconds as well. I can observe this behavior on both my Dell XPS 15 9560 and my Surface Go, both running 19041.508.

Keeping the system for an extended amount in modern standby (e.g. overnight) will almost certainly reproduce this behavior. Most of the time it will also result in VSCode asking for "reload the window" as well (like in #3126).

Therefore, I really hope that you migrate to AF_VSOCK/AF_HYPERV instead. Or perhaps, make this extension open source so I can modify it. 😁

Oct 04 '20 23:10 jamestut

Yeah this problem is horrible please escalate. I find myself having to restart vscode 2x to 3x daily because I do all my work in WSL remote and the reload ruins my terminal sessions with running programs. Terrible for programmer productivity.

Also for whatever stupid clearly related reason workplace settings on remote are not honored. So my windows always lose their color scheme (I use a different color scheme for each project). I'll open a second ticket for this issue but mention it because this chore of manually resetting color schemes in 6 windows wouldn't be relevant except for this super weird reload bug.

Please fix it ASAP thanks.

Oct 08 '20 05:10 ninjaa

I'd add that once WSL 2 is started, the hns service can be disabled to prevent the Hyper-V virtual switches from disconnecting when the computer enters modern standby. On my Dell XPS 15 9560, disabling this service doesn't seem to prevent the computer from entering DRIPS either.

The only downside is that if we stop the WSL 2 (e.g. by logging off or running wsl --shutdown), it won't be able to start. In that case, to restart WSL 2 without rebooting the computer, we can take the following steps in order:

Stop the LxssManager service.
Enable and start the hns service.
Start the WSL 2 using the regular means.
Stop and disable the hns service.

Had done this in the last 4 days and no more reconnecting / "please reload window" after resuming from modern standby, even when the modern standby lasted overnight. It now works flawlessly as it should. But I have to stress again, this is just a hack, and there should be no reason for VSCode not to use AF_HYPERV/AF_VSOCK.

Also to Hyper-V team (where I can file this to them by the way?), there should be no reason why would you disconnect the virtual switches when the computer enters modern standby. I didn't find any single issue so far, even with regular Hyper-V VMs. Both the default NAT network and bridged networks works flawlessly as well with the hns service disabled after I've started at least one VM that uses the virtual switches.

Oct 10 '20 10:10 jamestut

For me the issue started when I installed 2 extensions on VSCode to the point that VSCode would just work for 2-3 seconds only. After deleting them, it is working so far. Extensions are:

https://marketplace.visualstudio.com/items?itemName=jebbs.plantuml
https://marketplace.visualstudio.com/items?itemName=jkeys089.plantuml

Nov 24 '20 17:11 minamaged113

I wonder if this is happening because of clock drift, especially after hibernation? I always get this "Cannot reconnect" from VS Code after waking my PC from hibernation, and there are known problems with the WSL clock not re-syncing with the system clock: https://github.com/microsoft/WSL/issues/5324

Jan 10 '21 00:01 drkvogel

I wonder if this is happening because of clock drift, especially after hibernation? I always get this "Cannot reconnect" from VS Code after waking my PC from hibernation, and there are known problems with the WSL clock not re-syncing with the system clock: microsoft/WSL#5324

I don't think so. This happens to my system with hybernation completely turned off (powercfg -h off) during modern standby. And clock drifts didn't happen at all under modern standby AFAIK, at least on my systems. Further, disabling the hns service (in my previous post) pretty much solved the problem. After disabling the hns service, sometimes, VSCode's remote extension do still timeout, but reconnecting happens instantly (e.g. in a blink of an eye), and I never been asked to reload the Window.

Now I am running Insider build 20279, and this problem is mostly gone (much less frequent compared to in 19041). However, disabling the hns service still yields the best result.

Jan 11 '21 01:01 jamestut

Seems related to https://github.com/microsoft/vscode-remote-release/issues/3158 and https://github.com/microsoft/vscode-remote-release/issues/3126.

This is incredibly annoying, because the only way I've found to break the reload cycle is to reload the window, sit there waiting for it to pop up with the 5 second reconnect countdown, and then interrupt that 5 second countdown by clicking the "reload window" button on that message. If I miss that 5 second interval and let it try to reconnect on its own, it will fail again, and the cycle starts over.

Jan 21 '21 19:01 tristanbrown

@alexdima see https://github.com/microsoft/vscode-remote-release/issues/3454#issuecomment-764893709

Jan 22 '21 08:01 aeschli

This is incredibly annoying, because the only way I've found to break the reload cycle is to reload the window, sit there waiting for it to pop up with the 5 second reconnect countdown, and then interrupt that 5 second countdown by clicking the "reload window" button on that message. If I miss that 5 second interval and let it try to reconnect on its own, it will fail again, and the cycle starts over.

@tristanbrown As a workaround, you can always do F1 > Developer: Reload Window. You could even assign a keybinding to it so you don't need to wait.

The "Reload Window" button is shown all the time once the modal dialog comes in:

But @aeschli I miss the context here. Is it necessary to do two reloads for the window to recover in this case ?

Jan 22 '21 09:01 alexdima

A single reload does the trick, right @dbaeumer ?

Jan 25 '21 09:01 aeschli

Right, a single Windows reload makes it work for me again.

Jan 29 '21 08:01 dbaeumer

This issue started happening for me too since lately. I don't know whether it's due to a Windows update or an extension update. I put my PC in hybernation overnight.

It used to work flawlessly.

Feb 11 '21 09:02 ad-on-is

Stop the LxssManager service.

Enable and start the hns service.

Start the WSL 2 using the regular means.

Stop and disable the hns service.

@jamestut How do you stop LxssManager and enable/disable hns?

Feb 18 '21 01:02 drkvogel

Guys, I think my issue #4353 may be related to this. I hibernate my computer every night. I just restarted my computer and could not reproduce it, while it happened every time before the restart. I am pretty sure I restarted yesterday because of an update so I think 1 hibernation is enough to trigger the problem. I guess after hibernating tonight it will start happening again.

I wonder if something happens after hibernation which makes reconnection not work at all? The reproduction would be:

Open 2 projects on the same WSL 2 instance
Reload one of them
The other should lose connection and ask to reconnect/reload.

@aeschli, is there a way we can assist in fixing this?

Mar 03 '21 07:03 BladeMF

Update: After 6 hours (11AM to 5PM) of hibernation the problem does not occur. Will see tomorrow morning.

Mar 03 '21 15:03 BladeMF

https://github.com/microsoft/vscode-remote-release/issues/3126#issuecomment-836255588

May 10 '21 06:05 tebeco

I have recently pushed a change which might improve things in this area -- https://github.com/microsoft/vscode/commit/32d29d71262ce330097e1f9d826344912d6203ad

Basically, if the vscode renderer would disconnect from the server before the laptop went to sleep (let's say at 10pm), and then the laptop would be opened in the morning (let's say at 8am), then the reconnection loop would attempt exactly once to reconnect. If WSL would be fast enough to restore the network port forwarding, then the reconnection would succeed on the first try. But if WSL would be a bit slow in restoring the network port forwarding, then the reconnection loop would give up after a single attempt, because it would think that more than 3hrs elapsed and the server would not wait so long for a reconnection.

I have changed the logic to attempt to reconnect 360 times before giving up instead of 3hrs. The original idea was to have some kind of stop mechanism eventually, and trying multiple times (with a counted limit instead of a time limit) has the same characteristic.

So I'd like to ask if anyone that is using VS Code Insiders is still seeing issues in this area. The change has not yet made it to stable, but will be a part of our upcoming 1.58 release

Jun 29 '21 12:06 alexdima

I have recently pushed a change which might improve things in this area -- microsoft/vscode@32d29d7 … So I'd like to ask if anyone that is using VS Code Insiders is still seeing issues in this area. The change has not yet made it to stable, but will be a part of our upcoming 1.58 release

@alexdima I think your fix targets only one error in this complex bug: the reconnection timeout. I was experiencing this issue (more precisely, #3126 since I experience this issue on both WSL 1 and WSL 2 and my my device does not support Modern Standby) After the fix the issue, I am still experiencing it (at least on WSL 1, not tested on WSL 2). But now if I dig deep into the logs, I noticed that the logged error messages have changed.

Before the fix 32d29d7 my logs are posted in https://github.com/microsoft/vscode-remote-release/issues/3126#issuecomment-807042511. For convenience I will repost the download links to the logs before the fix here:

wsl1.log wsl2.log

My log after the fix is here: wsl1-32d29d7-fix.log

Notice the date changed from 08-08 to 08-09 as I left my laptop sleep overnight, and I was greeted with the error in the morning.

The Unknown reconnection token error in the log leads me to this issue comment: https://github.com/microsoft/vscode-remote-release/issues/1616#issuecomment-686788922

I would be happy to give out more diagnostic information if necessary.

Aug 10 '21 04:08 kevin-he-01

This is incredibly annoying. I tried the disable hns service workaround but it does nothing to stop the issue from happening.

I keep getting this annoying screen that I cannot bypass nondestructively:

There is no "Reconnect Now" button. I have to either do a reload (and risk losing my terminal contents), or wait a minute or so for it to connect automatically. It seems that it is trying to connect to the WSL 2 VM but the connection attempt hangs for about half a minute before it times out (~20 seconds in this log, you can get this by subtracting 55 by 32 in the timestamps in this log). The extension should at least provide an option to change the timeout.

Here is my log:

[2021-09-06 17:45:32.534] [renderer3] [info] [remote-connection][ExtensionHost][840ad…][reconnect] received socket close event (wasClean: false, code: 1006, reason: ).
[2021-09-06 17:45:32.537] [renderer3] [error] {"isTrusted":true}
[2021-09-06 17:45:32.540] [renderer3] [info] [remote-connection][ExtensionHost][840ad…][reconnect] starting reconnecting loop. You can get more information with the trace log level.
[2021-09-06 17:45:32.563] [renderer3] [info] [remote-connection][ExtensionHost][840ad…][reconnect] resolving connection...
[2021-09-06 17:45:32.791] [renderer3] [info] [remote-connection][ExtensionHost][840ad…][reconnect] connecting to 172.17.148.156:35647...
[2021-09-06 17:45:32.793] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] received socket close event (wasClean: false, code: 1006, reason: ).
[2021-09-06 17:45:32.794] [renderer3] [error] {"isTrusted":true}
[2021-09-06 17:45:32.795] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] starting reconnecting loop. You can get more information with the trace log level.
[2021-09-06 17:45:32.797] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] resolving connection...
[2021-09-06 17:45:32.803] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] connecting to 172.17.148.156:35647...
[2021-09-06 17:45:55.469] [renderer3] [error] [remote-connection][Management   ][10822…][reconnect][172.17.148.156:35647] socketFactory.connect() failed or timed out. Error:
[2021-09-06 17:45:55.470] [renderer3] [error] Error: WebSocket close with status code 1006
    at WebSocket.<anonymous> (vscode-file://vscode-app/c:/Users/REDACTED/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/workbench/workbench.desktop.main.js:609:139641)
[2021-09-06 17:45:55.470] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] A temporarily not available error occurred while trying to reconnect, will try again...
[2021-09-06 17:45:55.470] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] waiting for 5 seconds before reconnecting...
[2021-09-06 17:45:55.544] [renderer3] [info] [remote-connection][ExtensionHost][840ad…][reconnect] reconnected!
[2021-09-06 17:46:00.473] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] resolving connection...
[2021-09-06 17:46:00.479] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] connecting to 172.17.148.156:35647...
[2021-09-06 17:46:00.537] [renderer3] [info] [remote-connection][Management   ][10822…][reconnect] reconnected!

The remote WSL extension is a utterly pre-alpha-grade, inaccessible, and closed source mess right now. This issue is in the backlog so I don't expect it to be fixed any time soon. Go switch to Linux and don't use Windows!

Sep 07 '21 03:09 kevin-he-01

@kevin-he-01 I'm sorry the experience is so rough right now around sleeping and WSL. While things around sleeping and reconnection have marginally improved over time, I agree with you that the experience here is not good.

From looking at your logs, I think the first issue is that while the client-side continues to attempt to reconnect after waking up from sleep, the server side will kill the remote extension host process after 3hrs. We use setTimeout in the server implementation to drop disconnected clients after being disconnected for more than 3hrs. It looks like when putting the computer to sleep for more than 3hrs, when it wakes up, the event loop on the server process will execute the scheduled setTimeout and the server will think that it has a client disconnected for more than 3hrs, and it will kill the remote extension host process. This would explain the Unknown reconnection token error in the log. I will do two things in this area:

improve the message to better distinguish the cases when the server dropped the client vs. when the server never saw the client.
look into creating an alternative to setTimeout, which doesn't work on wall-clock time, but in program execution time, such that the server would drop a client only after really running for 3hrs, not after 3hrs elapsed in the real world.

The last thing you are noticing where the first TCP/IP connection does not go through to WSL might stem from the WSL implementation at the OS level. Maybe immediately after waking up from sleep, their network adapters are not immediately set up correctly. I don't have an immediate idea on how to tackle this.

Sep 07 '21 07:09 alexdima

I wonder if it's worth mentioning that since upgrading my Windows laptop from an HP Elitebook 840 G3 to a Microsoft Surface Laptop 4, I have had very little trouble with this issue. Maybe because it's a much faster machine? i.e. are there timeouts on trying to reconnect that are too short for slower machines to complete on waking from hibernation?

Sep 07 '21 10:09 drkvogel

@kevin-he-01 I've pushed improvements to the error message and to our use of setTimeout that might get the server to drop a client when waking from sleep. These changes will be available in tomorrow's insiders (the build will run in about 16hrs), please give it a try and let us know if things are better after resuming from sleep.

Sep 07 '21 13:09 alexdima

@alexdima Thanks for your response. Just to clarify my first post here is about my old laptop without modern standby — An Asus Vivobook — and running WSL 1 (though the same issue occurred on WSL 2 on that machine as well). By running ps -ef before and after reloading, I determined that you are right in that the remote server restarted itself leading to the Unknown reconnection token error (The PIDs changed). Your setTimeout fix will probably fix that.

My second post is about another reconnection issue on my new Surface Laptop 4. Like what drkvogel said, the experience is better with the new laptop (assuming I use WSL 1 on the new laptop), presumably because modern standby does not yank local (happening on localhost) TCP connections:

I wonder if it's worth mentioning that since upgrading my Windows laptop from an HP Elitebook 840 G3 to a Microsoft Surface Laptop 4, I have had very little trouble with this issue. Maybe because it's a much faster machine? i.e. are there timeouts on trying to reconnect that are too short for slower machines to complete on waking from hibernation?

However, on my new laptop with WSL 2, the experience is not so good when I put my computer to sleep (modern standby). It is probably some Hyper-V/WSL2 network switching issue. It seems that the host-VM connection is tampered with when the computer enters some "deeper" sleep stage in modern standby (more than a few minutes). I get the "Disconnected. Attempting to reconnect" screen with no way to cancel and restart the connection attempt instantly (I can only reload or wait). I tried the hns workaround but nothing changes.

@alexdima, I wonder if you can provide some insights on how VSCode remote communicates with WSL? Is there some heartbeat that detects disconnection? It seems that it is based on WebSocket just by looking at the logs. I know the extension is closed source. I am asking this because I need a way to reproduce the WSL 2 connection hang (~20 seconds reconnection time) using visible and simpler code snippets so I can report this issue to WSL and see what workarounds are available. Currently, I am not able to reproduce any network delay using any code I write: it only happens when VSCode tries to reconnect.

Apologies for having 2 issues to track at a time.

Sep 07 '21 21:09 kevin-he-01

Today's insiders contains the setTimeout fix, so the server side should now wait 3hrs for the client side to reconnect after a wake from sleep.

We have our own heartbeat mechanism to detect disconnection, but from looking at the log from your second post, it doesn't look like that is to blame here. In this case, it looks like we receive two "socket close events" from the network stack (we use WebSockets in an Electron renderer process, which use the Chromium network stack) and that is what triggers the reconnection.

But what is curious to me is the order of events. Here is your log a bit trimmed down:

[17:45:32.534] [ExtensionHost] received socket close event (wasClean: false, code: 1006).
[17:45:32.791] [ExtensionHost] connecting to 172.17.148.156:35647...
[17:45:32.793] [Management   ] received socket close event (wasClean: false, code: 1006).
[17:45:32.803] [Management   ] connecting to 172.17.148.156:35647...
[17:45:55.469] [Management   ][172.17.148.156:35647] socketFactory.connect() failed or timed out. Error: WebSocket close with status code 1006
[17:45:55.470] [Management   ] A temporarily not available error occurred while trying to reconnect, will try again...
[17:45:55.470] [Management   ] waiting for 5 seconds before reconnecting...
[17:45:55.544] [ExtensionHost] reconnected!
[17:46:00.479] [Management   ] connecting to 172.17.148.156:35647...
[17:46:00.537] [Management   ] reconnected!

And the interpretation:

we create a new WebSocket for the ExtensionHost connection [A]
12ms later we create a new WebSocket for the Management connection [B]
23 seconds later we receive an error from the connection [B]
74ms later the connection [A] is up and running, having finished the reconnection flow
5 seconds later we create a new WebSocket for the Management connection [C]
58ms later the connection [C] is up and running, having finished the reconnection flow

To me, it is peculiar that connection [B], that is created immediately after connection [A], fails, while connection [A] succeeds. But in any case, it looks like it takes up to 20+ seconds for either WebSocket to succeed or fail. I don't really know how WSL is waking up from sleep and if this is to be expected or not.

To create a standalone repro, I guess you could try to reproduce by running a web server inside WSL and opening a web page in Edge/Chrome to it and from the web page creating 2 WebSockets to the WSL server. I think those two WebSockets will get closed (like they do for us) when waking the computer from sleep. Then, the page could try to create new WebSockets and log if that succeeds or not (or how long it takes for it to succeed). Maybe this also depends if the network conditions change for the laptop between going to sleep and waking up (e.g. closing the lid at work, opening it at home).

@aeschli Maybe we need to ask someone from WSL what is expected in this case.

Sep 08 '21 11:09 alexdima

This happens to me without the laptop going to sleep. Just pops up the dialog at random intervals. Stand-alone WSL sessions are unaffected. Doing the reload reconnects. It's starting to be unusable...

node sure is maxing out the CPU, but total memory usage is les than 50%.

Oct 29 '21 20:10 rednevals