gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

[ws-daemon] Unexpected cannot find workspace during DisposeWorkspace

Open aledbf opened this issue 2 years ago • 5 comments

Bug description

Jaeger-UI

Steps to reproduce

Workspace affected

No response

Expected behavior

If it's not really an error condition, then change this to be a warning. The warning would essentially be an indicator that we're calling dispose 2 or more times, and the 1st time would have done disposal.

Example repository

No response

Anything else?

When does this error happen? Is it really an error? Is there anything else we can log with it, to determine the state of the workspace on the node (/workspace still exist?)

aledbf avatar Jul 28 '22 00:07 aledbf

👋 @jenting , may I ask you to focus on https://github.com/gitpod-io/gitpod/issues/11713, move this back to scheduled, and unassign yourself for now? This way you can focus on #11713. If you think there's a need to do both at the same time, can you share why? I'm not seeing the relationship, and am trying to limit work in progress.

kylos101 avatar Aug 02 '22 14:08 kylos101

👋 @jenting alternatively, if you already started #11710, please move #11713 back to scheduled, etc.

kylos101 avatar Aug 02 '22 20:08 kylos101

From the Jaeger tracing, the observation is that we call the finalizeWorkspaceContent almost within the same time.

case 1. image image

case 2. image image

jenting avatar Aug 03 '22 04:08 jenting

I am still seeing in us60, search workspace logs for instanceId eb3d43eb-18bc-47fe-83b0-0adf388dcfb8 for detail from August 9.

Image

kylos101 avatar Aug 10 '22 02:08 kylos101

In this case, it has already been erased from ws-daemon's store. In other words, it was terminated normally, but the disposal was canceled due to a timeout in the context. In other words, it was terminated normally but failed due to a timing issue. image

I am still seeing in us60, search workspace logs for instanceId eb3d43eb-18bc-47fe-83b0-0adf388dcfb8 for detail from August 9.

Image

utam0k avatar Aug 10 '22 09:08 utam0k

We need to check the jaeger tracing on the gen63 cluster to see if it still happens or not.

jenting avatar Aug 25 '22 08:08 jenting

Unfortunately, this problem still happened on the gen63 😭 e.g. cc69aed8-7cd1-4055-8f90-777f2281cc3b

utam0k avatar Aug 26 '22 03:08 utam0k

Looking at cc69aed8-7cd1-4055-8f90-777f2281cc3b INITIALIZING phase started at say 0 seconds. It lasted 6.5s. While it was running, pod went from INITIALIZING to STOPPING phase at +3 seconds. At this point, we are now starting to Dispose workspace, at the same time as initializeWorkspace is still running and finishing its work. In this specific example, it actually failed, because by the time it got to talking to ws-daemon, we already disposed workspace. Hence a bunch of errors with cannot find workspace. Edit: Another issue is that finalizeWorkspaceContent may get called twice one after another, and both will call DisposeWorkspace, with the second call getting cannot find workspace. Seems like we need to first check if workspace even exists still. :thinking:

Seems like we have an architectural problem due to ability to run phases in parallel. :thinking:

sagor999 avatar Aug 26 '22 20:08 sagor999