gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

[WIP][ws-manager] Re-create workspace pods on rejection

Open geropl opened this issue 1 year ago • 1 comments

Description

This PR enables ws-manager to re-create a workspace pod when it detects that the pod is stuck.

The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled - does no longer fit onto the node and is stuck. It specifically shows up on AWS (and also GCP!), and has a ~1% chance of occurring (from our limited investigation).

This change includes:

  • status.go detects the specific failure reason, and marks the workspace with a new condition PodRejected
    • the usual mechanic for a failed workspace pod is employed: it is deleted
  • workspace_controller.go has a new condition: if a workspace has the PodRejected: true condition and 0 pods, it:
    • if above retry limit: abort with failed
    • resets workspace.status
    • reset metrics
    • increment PodRecreated
      • this in turn triggers the controller on the next run to run into the regular "start workspace" phase again
    • re-queue's the workspace (with a configurable timeout)

TODO:

  • [x] add happy-path test
  • [ ] manual tests
    • [x] regular workspace starts
    • [ ] prebuild workspace starts
    • [ ] regular workspace + forced "pod rejection"
  • [ ] add "hit max retries" test
  • [x] add "WorkspaceStatus" test
  • [ ] do load test, as we saw the CPU variant on GCP as well (docs and docs)
  • [x] discuss with folks to make sure we don't miss any edges cases

Related Issue(s)

Fixes ENT-530

How to test

Check tests are sensible and :green_circle:

Manual: regular workspace

Manual: prebuild workspace

Documentation

Preview status

gitpod:summary

Build Options

Build
  • [ ] /werft with-werft Run the build with werft instead of GHA
  • [ ] leeway-no-cache
  • [ ] /werft no-test Run Leeway with --dont-test
Publish
  • [ ] /werft publish-to-npm
  • [ ] /werft publish-to-jb-marketplace
Installer
  • [ ] analytics=segment
  • [ ] with-dedicated-emulation
  • [ ] workspace-feature-flags Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • [ ] /werft with-local-preview If enabled this will build install/preview
  • [ ] /werft with-preview
  • [ ] /werft with-large-vm
  • [x] /werft with-gce-vm If enabled this will create the environment on GCE infra
  • [x] /werft preemptible Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • [ ] with-integration-tests=all Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • [ ] with-monitoring

/hold

geropl avatar Sep 26 '24 08:09 geropl

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 18 '24 15:10 github-actions[bot]

Still work in progress

geropl avatar Oct 24 '24 09:10 geropl

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 04 '24 15:11 github-actions[bot]

@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected.

Two things I still need to test (tomorrow):

  • have dev/rejector running against the preview
  • manually (+ via prebuild) start workspaces
  • check that the UX does not break :+1:

Once that's done, we can merge. :slightly_smiling_face:

geropl avatar Nov 14 '24 16:11 geropl

Done testing! :partying_face: :tada:

Added a small logging improvement, otherwise just need :heavy_check_mark: to :ship: !

geropl avatar Nov 15 '24 10:11 geropl

A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

*: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm.

geropl avatar Nov 15 '24 12:11 geropl

/unhold

geropl avatar Nov 15 '24 12:11 geropl

The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

iQQBot avatar Nov 15 '24 12:11 iQQBot

content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).

But content init happens in sync. And all workspace-state related changes are handled by workspace_controller.go, using the controller framework by k8s, which I understood handles one event at time (e.g. handles synchronization).

geropl avatar Nov 15 '24 12:11 geropl