vscode-remote-release icon indicating copy to clipboard operation
vscode-remote-release copied to clipboard

Remote VSCode over SSH crashes EC2 instance

Open Zenexer opened this issue 5 years ago • 141 comments

Issue Type: Bug

I've been attempting to use the new remote VSCode feature to work with a project stored on an AWS EC2 instance. Each time I use it, it works fine for a few hours. Eventually, the whole instance stops responding. AWS indicates that the instance is unresponsive in the control panel, and I have to force-stop it. The screenshot/log feature on AWS doesn't show anything. Once I boot the instance back up, there's nothing in the logs--they just cut off at the time the instance stopped responding. I wish I had more information to give you, but I'm at a loss of how to troubleshoot this.

Other notes:

  • If I leave htop or top open, when the instance finally crashes, there's no indication of anything unusual. Plenty of memory, etc.
  • VSCode complained about fs.inotify.max_user_watches being too low when I first started using it remotely. I increased it per VSCode's instructions and confirmed that it took effect. The warning went away, but the crashes still happen.
  • Even if I disconnect from the remote session, the instance will still crash.
  • Instance type: t3a.micro
  • Region: us-east-1
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan
cat /proc/cpuinfo
% cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x8001250
cpu MHz         : 2199.958
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.91
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:
...

VS Code version: Code 1.43.2 (0ba0ca52957102ca3527cf479571617f0de6ed50, 2020-03-24T07:38:38.248Z) OS version: Windows_NT x64 10.0.18363

System Info
Item Value
CPUs Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz (12 x 4008)
GPU Status 2d_canvas: enabled
flash_3d: enabled
flash_stage3d: enabled
flash_stage3d_baseline: enabled
gpu_compositing: enabled
multiple_raster_threads: enabled_on
oop_rasterization: disabled_off
protected_video_decode: unavailable_off
rasterization: enabled
skia_renderer: disabled_off_ok
video_decode: enabled
viz_display_compositor: enabled_on
viz_hit_test_surface_layer: disabled_off_ok
webgl: enabled
webgl2: enabled
Load (avg) undefined
Memory (System) 31.95GB (18.31GB free)
Process Argv
Screen Reader no
VM 0%
Extensions (4)
Extension Author (truncated) Version
remote-ssh ms- 0.51.0
remote-ssh-edit ms- 0.51.0
remote-wsl ms- 0.42.4
cpptools ms- 0.27.0

Zenexer avatar Apr 02 '20 23:04 Zenexer

Probably related to https://github.com/microsoft/vscode-remote-release/issues/2349. Are you opening a large folder and what extensions are you using?

roblourens avatar Apr 03 '20 18:04 roblourens

Yes, it's a relatively large folder. I'm not using any special extensions as far as I'm aware.

Zenexer avatar Apr 04 '20 17:04 Zenexer

I don't think it's related to that:

  • Memory usage is reasonable
  • I'm not seeing those reports.*.json files

Zenexer avatar Apr 04 '20 17:04 Zenexer

@roblourens According to https://github.com/microsoft/vscode-remote-release/issues/1110 1GB of RAM isn't enough.

I don't even try on t2.micro since it will without a doubt lock up the instance. This issue mentions t3a.micro which has the same size 1GB of RAM.

Has VSCode reduced the memory requirements, or is this still likely the same issue?

clshortfuse avatar Apr 23 '20 10:04 clshortfuse

I don't even try on t2.micro since it will without a doubt lock up the instance. This issue mentions t3a.micro which has the same size 1GB of RAM.

Memory was my first guess, but it's definitely not running out. There's still plenty available when it eventually crashes.

What's even more confusing is that the instance will still crash even after I've exited VSCode. Once it's been launched, the countdown starts--within a few hours, it will crash whether or not VSCode is still open.

Zenexer avatar Apr 23 '20 19:04 Zenexer

Hm, I would guess that memory is somehow the issue here even though you say it doesn't seem to be using a lot.

roblourens avatar Apr 25 '20 13:04 roblourens

I am concerned that this issue is far more than has been assumed. Using vscode remote I can crash any EC2 instance of any size (e.g., m5.xlarge) or distro (e.g., ubuntu/centos) by only using it for a very short period of time, even idling will kill it off. I attempted to contact the maintainers of the vscode remote plugin directly, but their MS Team Meeting failed and their return email was rejected as none existent following the meeting failure. I am going to need to direct all of my company to avoid the vscode remote services until this is fixed. It was a very exciting idea that showed great promise and I was going to move to endorse it as the staple ssh access method for all our devs and DevOps, but I can not do that now. I can not knowingly use vscode remote against my production ec2 instances just to watch it kill them. FYI, the effect is no ssh is possible from anything following the crash. The ec2 reboot command does not correct it, only complete stop and then start clears up the crash. Then do not use vscode remote and you will be fine. So, I am back to the old ssh terminal methods until this new service is production-grade.

steelkorbin avatar Jun 19 '20 20:06 steelkorbin

I can confirm what @steelkorbin has observed; it doesn't appear to be related to resource usage. No matter the instance type, using VSCode Remote arms a kernel time bomb. Even if VSCode Remote is exited, eventually the instance will crash--hard. No logs, no way to debug the issue. It could happen hours after VSCode Remote has been closed.

Zenexer avatar Jun 19 '20 20:06 Zenexer

has there been any development on this one? i open a very small directory , yet the vm instance crashes, pretty hard. Have no other way but to reload the VM

ssprakhar avatar Sep 07 '20 17:09 ssprakhar

I do not see the vs code team acknowledging or engaging this.

On Mon, Sep 7, 2020, 10:47 AM Prakhar Sharma [email protected] wrote:

has there been any development on this one? i open a very small directory , yet the vm instance crashes, pretty hard. Have no other way but to reload the VM

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-688454603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOWRHZB3FQU6QQCHT5FKYTSEUMCJANCNFSM4L4LP2OA .

steelkorbin avatar Sep 07 '20 17:09 steelkorbin

Your report is concerning but I can't reproduce it. I work with vscode remote frequently on an ubuntu cloud VM with an uptime of almost a year. I would need more info but I'm not even sure what to ask for. The easy possibility is that you have some remote extension installed which is causing issues.

roblourens avatar Sep 10 '20 18:09 roblourens

I'd be happy to do some testing if there's some steps to run some kind of diagnostic log. It's happening for me, and I've tried turning off all autosuggestions (which seemed to be a trigger) and disabling TS/JS extensions, but it's still happening randomly. I've resorted to using liximomo's sftp extension for now.

For now, I've been watching htop and it jumps to 100% right before it locks up the VM, and there doesn't seem to be any rhyme or reason to it.

On Thu, Sep 10, 2020 at 2:04 PM Rob Lourens [email protected] wrote:

Your report is concerning but I can't reproduce it. I work with vscode remote frequently on an ubuntu cloud VM with an uptime of almost a year. I would need more info but I'm not even sure what to ask for. The easy possibility is that you have some remote extension installed which is causing issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-690587902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TGKQH2BK7SIMQOUKD3SFEILZANCNFSM4L4LP2OA .

pyg avatar Sep 10 '20 18:09 pyg

@roblourens , its crashing on me with fresh install of vscode server on EC2. so all the extensions on the server as basically the ones that there by default. No other extension added. Funny thing is, if I go to extensions, or try to play around, the whole EC2 crashes, like there is nothing I can do to restore/access it. I have to restart the server. One more information, although just going to extensions page makes things slow and at the verge of crashing them, I managed to get TS/JS plugin disabled, that has improved things a bit.

Disabling the above was suggested in the following post https://medium.com/good-robot/use-visual-studio-code-remote-ssh-sftp-without-crashing-your-server-a1dc2ef0936d

ssprakhar avatar Sep 12 '20 21:09 ssprakhar

Would be helpful if you can figure out which process is the one using lots of CPU/memory. It may be the generic extension host process or it may be another process associated with some extension.

roblourens avatar Sep 12 '20 22:09 roblourens

OK I did a small test by re-typing an existing function under three conditions for five minutes each:

  1. vscode.typescript-language-features disabled. No freezing within 5 minutes.

  2. vscode.typescript-language-features enabled. Freezes within 5 minutes. image

  3. vscode.typescript-language-features *enabled *with typescript.disableAutomaticTypeAcquisition disabled (per https://stackoverflow.com/questions/52935211/disable-tsserver-for-visual-studio-code/52936301). Freezes within 5 minutes. image

To be clear, this is the extension disabled or enabled:

Name: TypeScript and JavaScript Language Features Id: vscode.typescript-language-features Description: Provides rich language support for JavaScript and TypeScript. Version: 1.0.0 Publisher: vscode

FWIW, I think there is another bug report similar to this one that is specifically related to this extension. Not sure if it's still open.

On Sat, Sep 12, 2020 at 6:12 PM Rob Lourens [email protected] wrote:

Would be helpful if you can figure out which process is the one using lots of CPU/memory. It may be the generic extension host process or it may be another process associated with some extension.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-691559836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TDJ4EE2JGOQSAV4SITSFPW6JANCNFSM4L4LP2OA .

pyg avatar Sep 12 '20 23:09 pyg

Can you share the project/repo you are working on? Is it very large?

roblourens avatar Sep 13 '20 17:09 roblourens

Can you share the project/repo you are working on? Is it very large?

It is boilerplate of a gatsby project

gatsby new MyProject

Not large. Super small. Tomorrow I will try on a repo with just one small text file, and update you.

ssprakhar avatar Sep 13 '20 17:09 ssprakhar

Is anybody having these issues with Azure? I only use EC2 and it's pretty much a guaranteed way to crash the VM. I use the Amazon Linux v1 and v2 and both have had issues.

I haven't tried the Ubuntu kernel but if that doesn't have issue, then we can possibly narrow it down to Amazon's kernel.

Edit: Looks pretty solid on Ubuntu v20.04 kernel.

Edit2: Died after installing eslint extension remotely.

clshortfuse avatar Sep 14 '20 14:09 clshortfuse

Screen Shot 2020-09-14 at 10 54 29 AM

Usage just balloons up. I'm wondering if it's Amazon that kills the server for using all the CPU credits. SSH will connect after this happens, but nothing else:

OpenSSH_8.1p1, LibreSSL 2.7.3
debug1: Reading configuration data /Users/carlos/.ssh/config
debug1: /Users/carlos/.ssh/config line 1: Applying options for 18.206.XXX.XXX
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 47: Applying options for *
debug1: Connecting to 18.206.XXX.XXX [18.206.XXX.XXX] port 22.
debug1: Connection established.
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.1

Edit: Even stopping and starting the instance won't allow it to keep working. It seems VSCode just uses up all the CPU credits and Amazon doesn't like that. The server won't even be allowed to start up since there's no credits left. I can't even open a regular SSH anymore, even after reboot.

clshortfuse avatar Sep 14 '20 14:09 clshortfuse

My issues are on the EC2 Ubuntu image. I could start an instance and give Rob access to reproduce if he wants, some time later this week.

Keen

On Mon, Sep 14, 2020 at 10:56 AM Carlos Lopez [email protected] wrote:

[image: Screen Shot 2020-09-14 at 10 54 29 AM] https://user-images.githubusercontent.com/9271155/93101766-b0be6e80-f678-11ea-9059-e917d5ad5131.png

Usage just balloons up. I'm wondering if it's Amazon that kills the server for using all the CPU credits. SSH will connect after this happens, but nothing else:

OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /Users/carlos/.ssh/config debug1: /Users/carlos/.ssh/config line 1: Applying options for 18.206.XXX.XXX debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 47: Applying options for * debug1: Connecting to 18.206.XXX.XXX [18.206.XXX.XXX] port 22. debug1: Connection established. debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1 debug1: Local version string SSH-2.0-OpenSSH_8.1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-692112713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TBL7EXBYQUJDAWGROLSFYVLBANCNFSM4L4LP2OA .

pyg avatar Sep 14 '20 19:09 pyg

I'm pretty sure it's RAM. Opening up a project it balloons from up to 750M/979M of RAM. Closing VSCode drops it down to 164M/979M.

I opened up VSCode again and it's ballooned to 754M again. The biggest culprit is extensions/node_modules/lib/tsserver.js. I don't think the issue is so much the TS server itself at is is the lack of memory limit whatsoever. It'll consume memory as it sees fit until it halts the system. At this point, just opening a project leaves me ~170M to work with.

What's interesting is that the node process is run with --max-old-space-size=3072 which makes little sense on a 1GB machine. I'd wager it's actually worse since I'd imagine V8 would detect the system memory available and impose a more rational limit. We're pretty much instructing it to use more RAM that allowed and I guess a crash shouldn't be unexpected.

I've modified ~/.vscode-server/data/Machine/settings.json to use:

{
  "typescript.tsserver.maxTsServerMemory": 256,
  "files.maxMemoryForLargeFilesMB": 384
}

The defaults are 3072 and 4096. Let's see if this helps now.

Edit: That still caused a crashed, because it ballooned to over 90%. I tried using 64M for TsServer, but it seems 128 is a forced minimum. It's running with --max-old-space-size=128 and it looks more stable now. Memory drops to 400M when approaching 900M.

Edit2: Yep, TsServer is crashing because it runs out of RAM. Now I got an error saying TsServer has crashed 5 times in a row. The output is:

Remote SSH: dev-test
OS Linux x64 5.4.0-1021-aws
CPUs Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz (1 x 2400)
Memory (System) 0.96GB (0.09GB free)
VM 0%

Better TsServer crashes because it's out of RAM than crashing the whole system though.

clshortfuse avatar Sep 14 '20 22:09 clshortfuse

The amount of RAM has nothing to do with it. This issue has taken down m5.large (8GB) , m5.xlarge (16GB) and m5.2xlarge (32GB) just as fast as any of the T series. I wish it was just a RAM capacity issue.

On Mon, Sep 14, 2020 at 3:20 PM Carlos Lopez [email protected] wrote:

I'm pretty sure it's RAM. Opening up a project it balloons from up to 750M/979M of RAM. Closing VSCode drops it down to 164M/979M.

I opened up VSCode again and it's ballooned to 754M again. The biggest culprit is extensions/node_modules/lib/tsserver.js. I don't think the issue is so much the TS server itself at is is the lack of memory limit whatsoever. It'll consume memory as it sees fit until it halts the system. At this point, just opening a project leaves me ~170M to work with.

What's interesting is that the node process is run with --max-old-space-size=3072 which makes little sense on a 1GB machine. I'd wager it's actually worse since I'd imagine V8 would detect the system memory available and impose a more rational limit. We're pretty much instructing it to use more RAM that allowed and I guess a crash shouldn't be unexpected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-692344119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOWRHY3S2DW6JPLHOP6Z7DSF2JLJANCNFSM4L4LP2OA .

steelkorbin avatar Sep 15 '20 19:09 steelkorbin

@steelkorbin The amount of RAM is 100% the reason why mine crashes. I'm not sure why your servers are having issues though.

How much memory left do you have up until it locks up?

clshortfuse avatar Sep 15 '20 19:09 clshortfuse

Edit: Even stopping and starting the instance won't allow it to keep working. It seems VSCode just uses up all the CPU credits and Amazon doesn't like that. The server won't even be allowed to start up since there's no credits left. I can't even open a regular SSH anymore, even after reboot.

You should still be able to boot a t#-series instance after using up all the credits, and the instance should continue to run if it's already running; it'll just run slower. If it's not booting at all or has stopped responding, that's a sign that something else might be wrong.

The amount of RAM is 100% the reason why mine crashes.

It seems to be pretty well established at this point that 1 GiB RAM is not enough and will cause a crash. That doesn't explain the "ticking time bomb" kernel panic that I was initially reporting, though--that happens regardless of available RAM, and it happens even if all VSCode processes have been killed. It could happen hours after the processes have exited, with no observed resource usage constraints.

@clshortfuse, note that this issue is very specifically for the whole machine crashing, without any logs. If it's just the VSCode processes that are crashing, especially due to resource constraints, that's probably a separate issue.

Zenexer avatar Sep 16 '20 16:09 Zenexer

I don't know what other resource would be consumed until the remote is unreachable. Does the number of open file handles increase constantly? lsof | wc -l (I'm just grasping at straws here)

roblourens avatar Nov 07 '20 01:11 roblourens

With t2.medium instance, I experience this issue quite often that the CPU usage goes 99%.

This happens even with TypeScript plugin disabled.

hylowaker avatar Nov 23 '20 03:11 hylowaker

I am experiencing the same issue with SSH Remote plugin when connecting to my AWS Lightsail instances.

My main instance (3GB RAM) stalls and becomes unresponsive within a few minutes after connection has been established. Command “htop” shows about a dozen or so identical vs-code-server processes that gradually fill up the RAM usage bar until all 3GB are used up and the instance becomes unresponsive.

I have disabled all plugins, my feeling is that there is a compatibility issue with VS Code and AWS as a whole which causes the SSH plugin to repeatedly start more and more service threads. It’s a real pity since I’d love to SSH integrated, it like this I had to instruct the team to ban VS Code for the time being.

This behaviour appears on Ubuntu 18.04 running on AWS Lightsail with 160GB HDD and 3GB RAM, as well as other instances with 2GB and 1GB RAM.

patrickmau avatar Nov 24 '20 14:11 patrickmau

I can confirm that this is still an existing issue. I first doubt if it was my product that crashing the server and traced all the way down to watch and log whole processes and it's usages, even in clean EC2 instance with only my dev files (WITHOUT running it) and freshly installed VSCode remote SSH plugin kills the instance. It locks down the whole system and only way to recover it is completely stop the instance and restart it again, thus killing all PM2 instances & life-time configs which is very disappointing for dev instances. From now I'm stop using this plugin, and If someone wants to use this plugin NEVER USE IT FOR PRODUCTION AWS INSTANCE.

Edit: It wasn't RAM usage problem in my case too. There was always a tons of RAMs left when instance crashes, with CPU loads reach 100% right before the instance crashes.

ScripterSugar avatar Dec 14 '20 05:12 ScripterSugar

This issue is far more insidious than your average issue, but I do have this to contribute as an observable fact. AWS might be happy about this, for you conspiracy theorist out there, meh. Given AWS was a "NO-GO" for VS Code remote, I did what any common person would do, pivot away from AWS. I selected DigitalOcean for a comparison test, spun up a droplet, nothing on it, launched VS Code attached, 5 mins later "POW", dead, just like at AWS. At this moment, tears, frustration, sigh, ugh, and I am looking for my hammer to fix this. And then another few mins later while I am lamenting this horrible reproducible event, my remote session was disconnected by the remote, and VS Code began to try and connect automatically. Hum, this never happens at AWS, my eye's widened, hum interesting, a slightly different failure result. If you have been in the industry long enough, sometimes new failure effects can add to the diag story and ultimate resolution, so I saw this failure as a positive event. I stepped back, killed my VS Code's attempts to reconnect to the dead droplet. I waited, then flipped over to the stats page for that droplet. The CPU had been taken to 100% and pegged during my VS Code session, RAM usage was not changed with plenty more to give. Then following the death of my VS Code session the CPU drops back down, hum interesting, is the droplet still operable at this point, in AWS it would not be, or would it? So I did what any of us normally do while expecting a different result, I attached my VS Code again. It worked, the droplet was running, then after about 5 mins, "POW" dead, then shortly after, "disconnected". Checked the stats, same event in the logs. No errors just maxed to the CPU limit. So this has given me some insight to color this a little bit more. If you have any experience with AWS you know how slow it is to change state on an instance, we have all become accustomed to it and it is mostly a moot point, we all understand boot cycles, but AWS has the longest for stop events, it is just the way they do it is slow, DigitalOcean is just faster on this. Because of my need to get things up and running with a failing VS Code tool, I never waited long enough to see if my AWS instance would recover, might try it later, but it has no value going forward. Given that any instance/droplet maxes out the CPU allocated, the various cloud providers let us run hot for a short period to allow for spikes, but when such pressure potentially will impact neighboring customers, they clamp down and throttle your CPU within their usage policy, that is fair. So the Instance/Droplet drops away due to resource constraints, but will come back after it can sort out all the CPU overhead, nothing broken, nothing wrong with AWS or DigitalOcean, just common service providers policy dealing with VS Code activating something out of control. I don't know what is wrong, it is above my pay grade, I would place money on some background file watcher process getting hung up on files/dirs in Linux that it will not let VS Code watch or files/dirs that it should not be watching, just a guess, I am hoping the experts can see more than I can and sort this out.

steelkorbin avatar Dec 14 '20 18:12 steelkorbin

Add me to the list, can reliably crash any spec EC2 instance (tried lightsail too which is basically an EC2 instance anyway) in a matter of minutes by just using vs code remote-ssh. Interestingly though, I can happily code php based projects for hours with no problem using remote-ssh, its when I'm doing react based projects that the server will crash. Cannot do anything other than forcefully shut it down on AWS and restart. Could be an issue with node.js?

markrosoftuk avatar Dec 14 '20 19:12 markrosoftuk