gitpod
gitpod copied to clipboard
[ws-daemon] Unexpected cannot find workspace during DisposeWorkspace
Bug description
Steps to reproduce
Workspace affected
No response
Expected behavior
If it's not really an error condition, then change this to be a warning. The warning would essentially be an indicator that we're calling dispose 2 or more times, and the 1st time would have done disposal.
Example repository
No response
Anything else?
When does this error happen? Is it really an error? Is there anything else we can log with it, to determine the state of the workspace on the node (/workspace
still exist?)
👋 @jenting , may I ask you to focus on https://github.com/gitpod-io/gitpod/issues/11713, move this back to scheduled, and unassign yourself for now? This way you can focus on #11713. If you think there's a need to do both at the same time, can you share why? I'm not seeing the relationship, and am trying to limit work in progress.
👋 @jenting alternatively, if you already started #11710, please move #11713 back to scheduled, etc.
From the Jaeger tracing, the observation is that we call the finalizeWorkspaceContent
almost within the same time.
case 1.
case 2.
I am still seeing in us60
, search workspace logs for instanceId eb3d43eb-18bc-47fe-83b0-0adf388dcfb8
for detail from August 9.
In this case, it has already been erased from ws-daemon's store. In other words, it was terminated normally, but the disposal was canceled due to a timeout in the context. In other words, it was terminated normally but failed due to a timing issue.
I am still seeing in
us60
, search workspace logs for instanceIdeb3d43eb-18bc-47fe-83b0-0adf388dcfb8
for detail from August 9.
We need to check the jaeger tracing on the gen63 cluster to see if it still happens or not.
Unfortunately, this problem still happened on the gen63 😭 e.g. cc69aed8-7cd1-4055-8f90-777f2281cc3b
Looking at cc69aed8-7cd1-4055-8f90-777f2281cc3b
INITIALIZING
phase started at say 0 seconds. It lasted 6.5s.
While it was running, pod went from INITIALIZING to STOPPING phase at +3 seconds.
At this point, we are now starting to Dispose workspace, at the same time as initializeWorkspace
is still running and finishing its work. In this specific example, it actually failed, because by the time it got to talking to ws-daemon, we already disposed workspace.
Hence a bunch of errors with cannot find workspace
.
Edit: Another issue is that finalizeWorkspaceContent
may get called twice one after another, and both will call DisposeWorkspace
, with the second call getting cannot find workspace
. Seems like we need to first check if workspace even exists still. :thinking:
Seems like we have an architectural problem due to ability to run phases in parallel. :thinking: