firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Interactive task's shell disconnects periodically

Open eu9ene opened this issue 1 year ago • 13 comments

Another issue I noticed is disconnecting of the shell when I switch a tab or window. Is it possible to connect with a regular ssh client?

eu9ene avatar Jan 11 '24 17:01 eu9ene

I checked with the Taskcluster folks -- interactive shells run through the same proxy as live logs. There's no way to connect to them with an ssh client. This is most likely caused by the same issue as #200.

bhearsum avatar Jan 12 '24 17:01 bhearsum

I use tmux and just refresh the page and then attach to tmux session again. I think it's good enough. We should document this behavior though.

eu9ene avatar Jan 24 '24 22:01 eu9ene

It's interesting that you were able to reconnect - the new image that fixes the known issue with disconnects hasn't been rolled out yet. This suggests that there is indeed a different cause of these. Once the new image is rolled out I'll try to reproduce and get to the bottom of it.

The next time you run an interactive task it would very helpful to keep the devtools open and save the Network tab output if you run into the issue. I imagine there's going to be one or more request that's hitting an error, and knowing what that is would go a long way to isolating it.

bhearsum avatar Jan 25 '24 00:01 bhearsum

This is waiting on steps to reproduce.

gregtatum avatar Feb 15 '24 16:02 gregtatum

I did a tree command with a lot of output, and it disconnected. I didn't get any network activity in the network monitor.

image

gregtatum avatar Feb 20 '24 20:02 gregtatum

I did a tree command with a lot of output, and it disconnected. I didn't get any network activity in the network monitor.

Thank you for capturing this! It certainly looks like the websock connection went away (I'm assuming Firefox is not logging that activity for some reason...). It's very interesting that it happened when there was a bunch of flurry of output though.

@gregtatum - do you happen to have the task id for this still? I'd like to try to trace this in the logs.

@lotas - Any insight you might have would be appreciated. As far as I can tell, the websocktunnel VM hasn't been restarted since January 31st. I also don't see any log messages that suggest the wst process in the VM has restarted.

bhearsum avatar Feb 21 '24 00:02 bhearsum

Task ID: dVns0FsuTG2VepHBE4gnyw

gregtatum avatar Feb 21 '24 03:02 gregtatum

Indeed, WST wasn't restarted around that time.

Nothing particularly interesting in the logs, although I can see a small spike in CPU/MEM and errors around the time (20:56 taken from screenshot) from websocktunnel:

push broadcasted : stream 14
stream 14: read waiting
(repeated ~380 times)

which are coming from https://github.com/taskcluster/taskcluster/blob/main/tools/websocktunnel/wsmux/stream.go#L214-L221 and https://github.com/taskcluster/taskcluster/blob/main/tools/websocktunnel/wsmux/stream.go#L303-L339

Doesn't look like a problem for me though, just transferring large chunks of data probably?

On the worker side:

Error occured: read tcp 127.0.0.1:53654->127.0.0.1:42540: read: connection reset by peer
Feb 20 12:56:03 translations-1-b-linux-v100-gpu-4-1tb--uter4ppsrwexafszjdlfgq start-worker 2024/02/20 20:56:02 Error while reporting error to client

2024/02/20 20:57:44 Error occured: websocket: close 1006 (abnormal closure): unexpected EOF

There are several of those, so something is happening between worker and WST that results in socket being closed.

@gregtatum

I didn't get any network activity in the network monitor.

I always thought that's a feature of network inspector and websocket connections - you need to open it before connection is established, otherwise it wouldn't show existing WS connections (open inspector, reload page)


something unrelated from worker logs, but probably worth checking @bhearsum
 2024/02/20 21:33:46 Saving file file-caches.json (absolute path: /file-caches.json)
 2024/02/20 21:33:46 Could not persist features: open file-caches.json: permission denied
 2024/02/20 21:33:46 goroutine 1 [running]:
 runtime/debug.Stack()
         /task_170559252012250/go/go/src/runtime/debug/stack.go:24 +0x5e
 main.HandleCrash({0xa19780?, 0xc0003fb200})
         /task_170559252012250/taskcluster/workers/generic-worker/main.go:335 +0x34
 main.RunWorker.func1()
         /task_170559252012250/taskcluster/workers/generic-worker/main.go:357 +0x2f
 panic({0xa19780?, 0xc0003fb200?})
         /task_170559252012250/go/go/src/runtime/panic.go:920 +0x270
 main.RunWorker.func2(0x1?)
         /task_170559252012250/taskcluster/workers/generic-worker/main.go:391 +0x2c
 main.RunWorker()
         /task_170559252012250/taskcluster/workers/generic-worker/main.go:498 +0x11ab
 main.main()
         /task_170559252012250/taskcluster/workers/generic-worker/main.go:159 +0x565
 2024/02/20 21:33:46  *********** PANIC occurred! ***********
 2024/02/20 21:33:46 open /tasks-resolved-count.txt: permission denied
 : PWD=/ ; USER=root ; COMMAND=/sbin/shutdown -h now 'generic-worker internal error'
 2024/02/20 21:33:46 WARNING: Could not get sentry DSN:
 CALL SUMMARY
 ============
 Method: GET
 Service: firefox-ci-tc.services.mozilla.com:
 Response Body:
 {
   "code": "InsufficientScopes",
   "message": "Client ID worker/fxci-level1-gcp/translations-1/b-linux-v100-gpu-4-1tb-standard/us-west1/8914644439234294299 does not have sufficient scopes and is missing the following scopes:\n\n```\nauth:sentry:generic-worker\n```\n\nThis request requires the client to satisfy the following scope expression:\n\n```\nauth:sentry:generic-worker\n```\n\n---\n\n* method:     sentryDSN\n* errorCode:  InsufficientScopes\n* statusCode: 403\n* time:       2024-02-20T21:33:46.680Z",
   "requestInfo": {
     "method": "sentryDSN",
     "params": {
       "project": "generic-worker"
     },
     "payload": {},
     "time": "2024-02-20T21:33:46.680Z"
   }
 }

lotas avatar Feb 21 '24 11:02 lotas

I think I managed to reproduce, or something similar, this by running head -n1 /dev/urandom. Unfortunately, the Firefox Dev Tools didn't give me anything useful. Now that I can repro at will I'll try to do in a local set up so that I can see what's going on with generic-worker and the wst sides.

bhearsum avatar Feb 21 '24 16:02 bhearsum

@lotas I did have it open while the error happened, but I could have opened after loading the page, so maybe it missed the websocket connection.

gregtatum avatar Feb 21 '24 16:02 gregtatum

@lotas I did have it open while the error happened, but I could have opened after loading the page, so maybe it missed the websocket connection.

ah ok, but I guess websocket connection would also show something like "connection was closed" at most. Hopefully Ben would be able to reproduce it locally again.

@bhearsum tc local env so far didn't have wst running .. I will try adding it to the compose setup, but I don't guarantee i can make it up and running quickly enough

lotas avatar Feb 21 '24 16:02 lotas

Did a bunch of digging here and filed https://github.com/taskcluster/taskcluster/issues/6848. As far as I can tell, any non-UTF8 character in the output will cause this. I think it's a bug in the shell UI - not any of the underlying services.

It's also possible this is a different bug I found with the same symptoms. @gregtatum - if it's possible to get the raw output from that tree command, that would help confirm that.

bhearsum avatar Feb 21 '24 20:02 bhearsum

awesome findings! We'll check

lotas avatar Feb 21 '24 21:02 lotas

We ended up fixing the non-UTF-8 bug I found in https://github.com/taskcluster/taskcluster/pull/6849.

As far as I know we haven't hit any interactive task issues since. Let's re-open this or open a new issue for them if/when we do.

bhearsum avatar May 21 '24 18:05 bhearsum