gitpod [WIP][ws-manager] Re-create workspace pods on rejection

Description

This PR enables ws-manager to re-create a workspace pod when it detects that the pod is stuck.

The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled - does no longer fit onto the node and is stuck. It specifically shows up on AWS (and also GCP!), and has a ~1% chance of occurring (from our limited investigation).

This change includes:

status.go detects the specific failure reason, and marks the workspace with a new condition PodRejected
- the usual mechanic for a failed workspace pod is employed: it is deleted
workspace_controller.go has a new condition: if a workspace has the PodRejected: true condition and 0 pods, it:
- if above retry limit: abort with failed
- resets workspace.status
- reset metrics
- increment PodRecreated
  - this in turn triggers the controller on the next run to run into the regular "start workspace" phase again
- re-queue's the workspace (with a configurable timeout)

TODO:

[x] add happy-path test
[ ] manual tests
- [x] regular workspace starts
- [ ] prebuild workspace starts
- [ ] regular workspace + forced "pod rejection"
[ ] add "hit max retries" test
[x] add "WorkspaceStatus" test
[ ] do load test, as we saw the CPU variant on GCP as well (docs and docs)
[x] discuss with folks to make sure we don't miss any edges cases

Related Issue(s)

Fixes ENT-530

How to test

Check tests are sensible and :green_circle:

Manual: regular workspace

Manual: prebuild workspace

Documentation

Preview status

gitpod:summary

Build Options

Build

[ ] /werft with-werft Run the build with werft instead of GHA
[ ] leeway-no-cache
[ ] /werft no-test Run Leeway with --dont-test

Publish

[ ] /werft publish-to-npm
[ ] /werft publish-to-jb-marketplace

Installer

[ ] analytics=segment
[ ] with-dedicated-emulation
[ ] workspace-feature-flags Add desired feature flags to the end of the line above, space separated

Preview Environment / Integration Tests

[ ] /werft with-local-preview If enabled this will build install/preview
[ ] /werft with-preview
[ ] /werft with-large-vm
[x] /werft with-gce-vm If enabled this will create the environment on GCE infra
[x] /werft preemptible Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
[ ] with-integration-tests=all Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
[ ] with-monitoring

/hold

Sep 26 '24 08:09 geropl

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 18 '24 15:10 github-actions[bot]

Still work in progress

Oct 24 '24 09:10 geropl

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 04 '24 15:11 github-actions[bot]

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size
golang/k8s.io/[email protected]	unsafe	`0`	23.9 MB
golang/k8s.io/[email protected]	environment, filesystem, network, shell, unsafe	`0`	4.17 MB
golang/k8s.io/[email protected]	environment, filesystem, network, shell, unsafe	`0`	13.9 MB

View full report↗︎

Nov 11 '24 08:11 socket-security[bot]

@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected.

Two things I still need to test (tomorrow):

have dev/rejector running against the preview
manually (+ via prebuild) start workspaces
check that the UX does not break :+1:

Once that's done, we can merge. :slightly_smiling_face:

Nov 14 '24 16:11 geropl

Done testing! :partying_face: :tada:

Added a small logging improvement, otherwise just need :heavy_check_mark: to :ship: !

Nov 15 '24 10:11 geropl

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

*: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm.

Nov 15 '24 12:11 geropl

/unhold

Nov 15 '24 12:11 geropl

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

Nov 15 '24 12:11 iQQBot

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

But content init happens in sync. And all workspace-state related changes are handled by workspace_controller.go, using the controller framework by k8s, which I understood handles one event at time (e.g. handles synchronization).

Nov 15 '24 12:11 geropl