gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

[ws-daemon] cannot find workspace during WaitForInit

Open aledbf opened this issue 2 years ago • 5 comments

Bug description

Screenshot from 2022-07-27 22-04-19

Steps to reproduce

Workspace affected

No response

Expected behavior

Log the error when its encountered by the service (instead of just returning an error), not just when consumers encounter it. Include instance ID as part of the log message, so we can have that context.

Example repository

No response

Anything else?

This is just the first step for now, get some insight to whether this condition is happening, with related context, so we can react later.

aledbf avatar Jul 28 '22 02:07 aledbf

@jenting 👋 hey bud, I'm not sure if you have an open PR for this, even draft is okay, can you link it to this issue?

kylos101 avatar Aug 02 '22 14:08 kylos101

@jenting 👋 hey bud, I'm not sure if you have an open PR for this, even draft is okay, can you link it to this issue?

Nope, haven't found the root cause. Unassign myself.

jenting avatar Aug 03 '22 01:08 jenting

We did a special handling within ws-manager to handle the gRPC not found error.

https://github.com/gitpod-io/gitpod/blob/e40e43d76120e5de702522e6b816f28b86a219c6/components/ws-manager/pkg/manager/monitor.go#L684-L698

jenting avatar Aug 03 '22 07:08 jenting

Looked trough every instance of this in the last 12 hours and > 90% of workspaces recover, because we retry the initialization as pointed out by @jenting above.

Furisto avatar Aug 03 '22 12:08 Furisto

Still in us60:

Image

Image

kylos101 avatar Aug 10 '22 01:08 kylos101

Still in us63

jenting avatar Aug 24 '22 13:08 jenting

One possible cause of this is this: https://github.com/gitpod-io/gitpod/issues/12357

sagor999 avatar Aug 24 '22 19:08 sagor999

@sagor999 assigning you because https://github.com/gitpod-io/gitpod/pull/12360 is not deployed yet

kylos101 avatar Aug 26 '22 16:08 kylos101

We still see this in the us64 cluster.

jenting avatar Sep 07 '22 07:09 jenting

Moved back to breakdown, since we're still seeing in us64, we should talk about a strategy in refinement on how to proceed, and update the issue description, prior to moving this back to scheduled.

kylos101 avatar Sep 09 '22 03:09 kylos101

This is not an error any more. This is simply a by product of how gRPC logs its errors in tracing. When we call /wsdaemon.WorkspaceContentService/WaitForInit we need to return a correct error: NotFound This is normal behaviour. Unfortunately that gets logged as error, even though NotFound is not considered error here, since we already disposed workspace. It is handled correctly by finalizeWorkspaceContent. And finalizeWorkspaceContent might get called multiple times, since we will do that every time pod sees any update to its state. I guess once we switch to wsman Mk2, we will be able to store state in its own CRD object, instead of storing it on the pod.

sagor999 avatar Sep 12 '22 21:09 sagor999

@sagor999 thank you for reopening! I removed the PR from the Workspace project, as the issue is already there and In-Progress.

kylos101 avatar Oct 04 '22 05:10 kylos101

Thank you @jenting for linking this issue and PR! :smile:

kylos101 avatar Oct 04 '22 05:10 kylos101