[WIP][ws-manager] Re-create workspace pods on rejection
Description
This PR enables ws-manager to re-create a workspace pod when it detects that the pod is stuck.
The root cause is an unfixed race condition in kubelet: node labels are out of sync, and suddenly the Pod - although already scheduled - does no longer fit onto the node and is stuck. It specifically shows up on AWS (and also GCP!), and has a ~1% chance of occurring (from our limited investigation).
This change includes:
- status.go detects the specific failure reason, and marks the workspace with a new condition
PodRejected- the usual mechanic for a failed workspace pod is employed: it is deleted
- workspace_controller.go has a new condition: if a workspace has the
PodRejected: truecondition and0pods, it:- if above retry limit: abort with failed
- resets
workspace.status - reset metrics
- increment
PodRecreated- this in turn triggers the controller on the next run to run into the regular "start workspace" phase again
- re-queue's the workspace (with a configurable timeout)
TODO:
- [x] add happy-path test
- [ ] manual tests
- [x] regular workspace starts
- [ ] prebuild workspace starts
- [ ] regular workspace + forced "pod rejection"
- [ ] add "hit max retries" test
- [x] add "WorkspaceStatus" test
- [ ] do load test, as we saw the CPU variant on GCP as well (docs and docs)
- [x] discuss with folks to make sure we don't miss any edges cases
Related Issue(s)
Fixes ENT-530
How to test
Check tests are sensible and :green_circle:
Manual: regular workspace
Manual: prebuild workspace
Documentation
Preview status
gitpod:summary
Build Options
Build
- [ ] /werft with-werft Run the build with werft instead of GHA
- [ ] leeway-no-cache
- [ ] /werft no-test
Run Leeway with
--dont-test
Publish
- [ ] /werft publish-to-npm
- [ ] /werft publish-to-jb-marketplace
Installer
- [ ] analytics=segment
- [ ] with-dedicated-emulation
- [ ] workspace-feature-flags Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
- [ ] /werft with-local-preview
If enabled this will build
install/preview - [ ] /werft with-preview
- [ ] /werft with-large-vm
- [x] /werft with-gce-vm If enabled this will create the environment on GCE infra
- [x] /werft preemptible Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
- [ ] with-integration-tests=all
Valid options are
all,workspace,webapp,ide,jetbrains,vscode,ssh. If enabled,with-previewandwith-large-vmwill be enabled. - [ ] with-monitoring
/hold
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still work in progress
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
New dependencies detected. Learn more about Socket for GitHub ↗︎
| Package | New capabilities | Transitives | Size | Publisher |
|---|---|---|---|---|
| golang/k8s.io/[email protected] | unsafe | 0 |
23.9 MB | |
| golang/k8s.io/[email protected] | environment, filesystem, network, shell, unsafe | 0 |
4.17 MB | |
| golang/k8s.io/[email protected] | environment, filesystem, network, shell, unsafe | 0 |
13.9 MB |
@iQQBot I added the loom for the load test (incl. rejector running, so "breaking" every workspace pod once), and addressed all the other comments. Would be great if you could review, and do a manual smoke test on the preview (once it's there again) that the "normal case" still works as expected.
Two things I still need to test (tomorrow):
- have
dev/rejectorrunning against the preview - manually (+ via prebuild) start workspaces
- check that the UX does not break :+1:
Once that's done, we can merge. :slightly_smiling_face:
Done testing! :partying_face: :tada:
Added a small logging improvement, otherwise just need :heavy_check_mark: to :ship: !
A question: If content-init has already started and the pod re-schedule at same node, and content is big enough, what will happen? Is there a possibility of a data overlap issue?
The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.
*: saying "should", because the ws-daemon code was surprinsingly complex, as we have three layers of feeback loops (k8s workspace, k8s pod, containerd) we have to synchronize. There still might be a whole somewhere, but after the extensive testing, I'm absolutely sure we are taking about 0.01% here, instead of the 1% we have atm.
/unhold
The approach in this PR is the "wiping mode" as you found out, where first all running handlers etc. are stopped, and finally all content directories are removed. The oprations are synchronized on multiple levels (one e.g. for content-init), so this should* not happen.
content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).
content-init is a special case that executes using another process and has no related context to control (i.e. the context is the context of ws-deamon).
But content init happens in sync. And all workspace-state related changes are handled by workspace_controller.go, using the controller framework by k8s, which I understood handles one event at time (e.g. handles synchronization).