gitpod
gitpod copied to clipboard
KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch
Is your feature request related to a problem? Please describe
We do not support live upgrades for the single cluster ref arch while workspaces are running.
Describe the behaviour you'd like
Before KOTS begins a deployment:
- Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.
- Stop workspaces, wait for them to backup/terminate,
kubectl delete pods -l component=workspacemay suffice - Then deploy Gitpod to the cluster (the assumption is KOTS deletes existing resources and then recreates them)
Additionally, as part of the monthly release cycle, a self-hosted test should be added, so that the upgrade flow with running workspaces is included as part of the testing.
Describe alternatives you've considered
N/A, this removes friction from the upgrade experience.
Additional context
The deploy process should not start in a live cluster while workspaces are running.
As of the August KOTS release, when a deploy is done to an existing cluster, resources are deleted, however...because ws-daemon was deleted, the workspaces could not backup, and thus could not be deleted. Therefore, it is imperative that we wait for workspace pods to be deleted (including imagebuild and prebuild), before deleting the Gitpod installation.
Customers that experience this issue willl incur data loss, and to clean-up the pods, must remove the related finalizer from the regular and prebuild workspaces.
@lucasvaltl @corneliusludmann may we ask for your help in treating this as a priority for the September release?
cc: @aledbf @atduarte
Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.
I'm afraid we are quite limited regarding the KOTS UX and cannot ask the user. @MrSimonEmms any ideas?
We cannot add a "this is the impact" type message, but there is always a confirmation before the deployment is made (unless they have auto-deployments configured). Documenting the impact in the Gitpod docs is the only option.
Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?
Suggestions
- I'd also suggest that, rather than using
kubectl deletethat this is written as part of the Golang binary. I've just spent a lot of time removing as much as we can from the bash script so we should be wary of that.
Questions
- What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?
My idea here for an absolute skateboard would be to add a preflight check (should be top of the list of preflight checks in the UI) to check for running workspaces. If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this PR.
Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?
Yes. Otherwise, the workspaces will continue to run, KOTS will delete the gitpod installation (including ws-daemon), and those running workspaces will never have their data backed up, resulting in data loss and :crying_cat_face: users.
What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?
I'm working on a test for this, @MrSimonEmms , where basically we want to prevent users from starting workspaces during outage windows for updates.
Options:
- Ideally we'd use
gpctlto update the cluster score to 0, or cordon it, so we do not try sending workspace starts to it - Another option may be to
kubectl scale --replicas=0 deployment/ws-manager -n gitpod, but the UX is poor here because it doesn't fail fast, however it might be a good short term solution
For awareness, I've created https://github.com/gitpod-io/gitpod/issues/13150, because we cannot easily test in our preview environments, due to the cluster name showing up as an empty string.
If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this https://github.com/gitpod-io/website/pull/2766.
@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.
Thanks for the clarification @kylos101. I agree with @lucasvaltl's earlier comment of having a 🛹 and then bringing this additional stuff into it. From experience, upgrades tend to only take a couple of minutes to run - if it's done immediately before the helm upgrade command a user will likely not be able to start a workspace quick enough for it to be a problem in most cases
@MrSimonEmms do we prompt the user to see which ref arch they're using? If they're using the single cluster ref arch, and there are running workspaces, it would be great if the deploy process can hard fail, sharing that workspaces are currently running.
In other words, my understanding is that the pre-flight checks are soft, and can be ignored. I'd hate for an administrator to shoot themselves in the foot, and cause users to lose data.
@kylos101 No, the only prompt is a big "deploy" button - they can choose to skip the pre-flight checks, where there's another "we don't recommend this - it may break things" alert. Again, we don't have any control over this content or whether they can skip it.
The idea is the pre-flight checks are idempotent and that a change only happens when they click "deploy"
@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.
@kylos101 Fair! What I proposed at least lessens the pain. If we can also get the installation into a state where new workloads cannot be started - all the better. Was just not sure if we can get something done for this in a reasonable timeframe :)
@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?
@kylos101 I've had a play and created a draft PR at #13125. Unfortunately, on the app I'm testing, the workspace pod seems to have stuck on terminating.
I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck?
@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?
@MrSimonEmms Yes sir, that is the desired effect.
I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck? It is not desirable to use force or grace-period.
What type of workspace were you testing with? Regular, prebuild, imagebuild?
@MrSimonEmms Yes sir, that is the desired effect.
@kylos101 thanks for clarifying.
What type of workspace were you testing with? Regular, prebuild, imagebuild?
Regular workspace this time, but I've seen that behaviour on all types of workspace.
It's one of those funny things that I've found over the years that if you run kubectl delete pods <workspace> it often hangs. It's not a problem when you're on a test instance and you just want to kill it, but it's a different thing if you're doing it programmatically on EVERY instance out there
I've done some more investigation on this and can confirm that gpctl workspaces stop WILL work - eventually...
The problem is that gpctl only authenticates via a kubeconfig file. When running normally, that's fine. However inside a pod, we don't have a kubeconfig file as we're authenticating as a service account.
The refactored Installer as an authClusterOrKubeconfig function, which (as the name implies) allows authentication via the supplied kubeconfig file or via detection of the serviceaccount.
Once that's in, we can stop workspaces using gpctl
And it would be very helpful if the gpctl workspaces command received a --namespace flag
EDIT: I may have found a workaround which I'm testing
I figured out how to use the gpctl function in a service account authorised environment. The deletion of workspaces now happens immediately before deployment and is in a function controlled by @gitpod-io/engineering-workspace
it would be very helpful if the
gpctl workspacescommand received a--namespaceflag
@MrSimonEmms Could you please open an issue for this? Or would you like to come out with a PR to enhance it? That would be great. Thank you.
@jenting I opened #13329 and #13330 last night, but closed as not required as found a workaround.
If you want to reopen them and work on them, please do but it's not urgent any more