gitpod The container could not be located when the pod was terminated

Bug description

This error message(The container could not be located when the pod was terminated) comes from kubelet. https://github.com/kubernetes/kubernetes/blob/4aa451e8458a7cbf78ed464e9e47e87d424541ce/pkg/kubelet/kubelet_pods.go#L1810-L1817

Potentially related: https://github.com/kubernetes/kubernetes/issues/104107

Steps to reproduce

I don't know

Workspace affected

No response

Expected behavior

There isn't this error message in productionthe

Example repository

No response

Anything else?

No response

Aug 10 '22 02:08 utam0k

@utam0k good find, I see with gen60, too.

Aug 10 '22 02:08 kylos101

Still seeing in gen63, cc: @sagor999 for posterity, I realize you're working other things, but, not sure if this is related. gen63 Logs.

Aug 25 '22 18:08 kylos101

Hm. I looked at one instance that had this error: 9e368af4-a27a-48d8-9c37-19dffb785c11 and inspected traces for it (eu63) but it did not show any errors. Furthermore, it showed that workspace stopped correctly (according to traces). :thinking:

Aug 26 '22 00:08 sagor999

I found we can reproduce this issue with our integration test on the preview environment.

Sep 01 '22 07:09 utam0k

I found we can reproduce this issue with our integration test on the preview environment.

How so @utam0k ?

Sep 04 '22 19:09 kylos101

I found we can reproduce this issue with our integration test on the preview environment.

@utam0k can you share? 🙏

Sep 09 '22 03:09 kylos101

I found we can reproduce this issue with our integration test on the preview environment.

@utam0k can you share? 🙏

Just run an integration test. However, I will include a fix to ignore this issue in the integration test.

Sep 09 '22 05:09 utam0k

@utam0k do you mean temporary ignore? For example, so we can incrementally have tests passing in main branch, while a fix is done for the failing test? Please share a permalink, referencing the part you plan to ignore.

Sep 11 '22 23:09 kylos101

@kylos101 Sorry for my lack of information. I'm going to put this if statement into the main branch. Please note the link I shared is not in the main branch. https://github.com/gitpod-io/gitpod/blob/4e824503da137d43ae901f37b972400254ca5b68/test/pkg/integration/workspace.go#L522-L523

Sep 12 '22 23:09 utam0k

Got this errer after a workspace timed out running on self-hosted preview env on Azure with a build from today.

Sep 30 '22 13:09 jldec

FYI that I tried to reproduce what @jldec experienced above on my (newly created, via terraform) GCP environment. I set the global timeoutDefault to 5m and had 2 different workspaces time out successfully after the 5 minutes. That of course does not mean that this never happens, but at least that this is flaky 🤔

Sep 30 '22 15:09 lucasvaltl

Another occurrence (Internal)

Oct 05 '22 20:10 kylos101

Ran into this again. Stopped two workspaces at the same time on an AWS Self-Hosted Environment (2022.09rc6). One was fine, the other showed the The container could not be located when the pod was terminated error. @nandajavarma will link asupport bundle with the logs. The workspace in question was: lucasvaltl-getrobotisla-clmsdohpu5r.ws.release-aws.[...]

Oct 07 '22 12:10 lucasvaltl

Here is the link to the support bundle for the release-aws cluster

Oct 07 '22 12:10 nandajavarma

From the above all reported, the issue happened when the workspace was stopping. I'm wondering if it's because of this

the ws-daemon backing up the content to GCS
the ws-daemon requests lots of resources from the node because backing up the content
the kubelet evicts the workspace pod because this node is out of resource
the workspace pod transited into an unknown state, and the error message reported
the ws-manager catches the workspace pod terminated, and shows the error message to the user

Oct 13 '22 08:10 jenting

This is still an issue with gen73, recent logs.

@jenting one way to resolve this, is to ship PVC, and avoid backup with ws-daemon altogether, you're right that stopping workspaces incur a heavy resource hit with ws-daemon.

I am going to remove this from our project, for now (putting it back to inbox) so that there is less work "in progress" so that we can focus. :wave:

Oct 28 '22 17:10 kylos101

I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:

cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": 
failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found

Tried multiple times and I'm not able to start my workspace.

Oct 31 '22 16:10 svenefftinge

Thanks for the report. But from the error log, it's another symptom.

Nov 01 '22 04:11 jenting

I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:
cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": 
failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found
Tried multiple times and I'm not able to start my workspace.

Thanks for the report. But from the error log, it's another symptom.

Nov 01 '22 04:11 jenting

@easyCZ just experienced this after 10 minutes in https://gitpodio-gitpod-cqng3518h09.ws-eu77.gitpod.io/, he was using VS Code Desktop for the IDE via the unstable channel.

workspace logs webapp logs

Nov 28 '22 15:11 kylos101

@easyCZ it looks like:

the IDE loop ended
But I'm unsure why. Perhaps this is a start for debugging? @akosyakov @utam0k wdyt?
FYI, I checked journalctl logs on the node too, but, nothing promising was evident for this workspace. :cry:

Nov 28 '22 15:11 kylos101

FYI: This error can also be seen rarely in integration tests. Now we are ignoring this error to pass the integration stably. https://github.com/gitpod-io/gitpod/blob/116ea559bc0979227467c9e30c717e786c2bee97/test/pkg/integration/workspace.go#L733

My first concern is whether the backup is complete. How about checking that first? I feel that would change the priority of this issue.

Nov 28 '22 23:11 utam0k

@utam0k @kylos101 Hey! Me, @WVerlaek, and @jenting are doing the refinement and have a couple of questions (mentioned in the description). Do you know the answer to any of them?

Jan 03 '23 08:01 atduarte

We see the log The container could not be located when the pod was deleted. The container used to be Running within the GCP log. (Under the jsonPayload.conditions.failed field).

Jan 04 '23 05:01 jenting

This issue increases the metric workspace_stops_total{reason="failed"} which affects our stop workspace SLO.

We can find it by comparing the Grafana and the GCP log.

Note We have duplicated log, so we should treat the same time period log (within one second) as a stopping workspace failed. Good news is the backup is complete which means no data loss jsonPayload.conditions.finalBackupComplete=true.

Jan 04 '23 07:01 jenting

@jenting can you update the description for https://github.com/gitpod-io/gitpod/issues/12021? It seems like you verified there is no data loss?

Also, regarding Milan's case, that is old, back when there was trouble in the IDE space causing workspaces to start slowly. I am skeptical it will provide value.

Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?

Jan 05 '23 15:01 kylos101

I checked over the last seven days GCP log.

There is only one instance without jsonPayload.conditions.finalBackupComplete=true. Checking the detailed GCP log, I am skeptical that it's a data loss case because the start workspace and stop workspace request is within a minute. The workspace never seems to have gone into a Running state.

Jan 09 '23 05:01 jenting

Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?

The traces [1], [2], and [3] indicate the workspaces are stopped correctly 🤔

Jan 10 '23 06:01 jenting

There are some reasons that the kubelet evicts the pods. We add disk and memory pressure tolerate to indefinitely, and network unavailable tolerate to 30 seconds. Refer to https://github.com/gitpod-io/gitpod/blob/1512a765b1f30c6ba9b1caae8fe5931fcb628145/components/ws-manager/pkg/manager/create.go#L544-L563

For the node not ready and unreachable tolerate to 300 seconds is also applied to the workspace pod.

Jan 10 '23 07:01 jenting

gitpod gitpod copied to clipboard

The container could not be located when the pod was terminated

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

gitpod
gitpod copied to clipboard