gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

The container could not be located when the pod was terminated

Open utam0k opened this issue 2 years ago • 1 comments

Bug description

logs

This error message(The container could not be located when the pod was terminated) comes from kubelet. https://github.com/kubernetes/kubernetes/blob/4aa451e8458a7cbf78ed464e9e47e87d424541ce/pkg/kubelet/kubelet_pods.go#L1810-L1817

Potentially related: https://github.com/kubernetes/kubernetes/issues/104107

Steps to reproduce

I don't know

Workspace affected

No response

Expected behavior

There isn't this error message in productionthe

Example repository

No response

Anything else?

No response

utam0k avatar Aug 10 '22 02:08 utam0k

@utam0k good find, I see with gen60, too.

kylos101 avatar Aug 10 '22 02:08 kylos101

Still seeing in gen63, cc: @sagor999 for posterity, I realize you're working other things, but, not sure if this is related. gen63 Logs.

kylos101 avatar Aug 25 '22 18:08 kylos101

Hm. I looked at one instance that had this error: 9e368af4-a27a-48d8-9c37-19dffb785c11 and inspected traces for it (eu63) but it did not show any errors. Furthermore, it showed that workspace stopped correctly (according to traces). :thinking:

sagor999 avatar Aug 26 '22 00:08 sagor999

I found we can reproduce this issue with our integration test on the preview environment.

utam0k avatar Sep 01 '22 07:09 utam0k

I found we can reproduce this issue with our integration test on the preview environment.

How so @utam0k ?

kylos101 avatar Sep 04 '22 19:09 kylos101

I found we can reproduce this issue with our integration test on the preview environment.

@utam0k can you share? 🙏

kylos101 avatar Sep 09 '22 03:09 kylos101

I found we can reproduce this issue with our integration test on the preview environment.

@utam0k can you share? 🙏

Just run an integration test. However, I will include a fix to ignore this issue in the integration test.

utam0k avatar Sep 09 '22 05:09 utam0k

@utam0k do you mean temporary ignore? For example, so we can incrementally have tests passing in main branch, while a fix is done for the failing test? Please share a permalink, referencing the part you plan to ignore.

kylos101 avatar Sep 11 '22 23:09 kylos101

@kylos101 Sorry for my lack of information. I'm going to put this if statement into the main branch. Please note the link I shared is not in the main branch. https://github.com/gitpod-io/gitpod/blob/4e824503da137d43ae901f37b972400254ca5b68/test/pkg/integration/workspace.go#L522-L523

utam0k avatar Sep 12 '22 23:09 utam0k

Got this errer after a workspace timed out running on self-hosted preview env on Azure with a build from today.

jldec avatar Sep 30 '22 13:09 jldec

FYI that I tried to reproduce what @jldec experienced above on my (newly created, via terraform) GCP environment. I set the global timeoutDefault to 5m and had 2 different workspaces time out successfully after the 5 minutes. That of course does not mean that this never happens, but at least that this is flaky 🤔

lucasvaltl avatar Sep 30 '22 15:09 lucasvaltl

Another occurrence (Internal)

kylos101 avatar Oct 05 '22 20:10 kylos101

Ran into this again. Stopped two workspaces at the same time on an AWS Self-Hosted Environment (2022.09rc6). One was fine, the other showed the The container could not be located when the pod was terminated error. @nandajavarma will link asupport bundle with the logs. The workspace in question was: lucasvaltl-getrobotisla-clmsdohpu5r.ws.release-aws.[...]

lucasvaltl avatar Oct 07 '22 12:10 lucasvaltl

Here is the link to the support bundle for the release-aws cluster

nandajavarma avatar Oct 07 '22 12:10 nandajavarma

From the above all reported, the issue happened when the workspace was stopping. I'm wondering if it's because of this

  • the ws-daemon backing up the content to GCS
  • the ws-daemon requests lots of resources from the node because backing up the content
  • the kubelet evicts the workspace pod because this node is out of resource
  • the workspace pod transited into an unknown state, and the error message reported
  • the ws-manager catches the workspace pod terminated, and shows the error message to the user

jenting avatar Oct 13 '22 08:10 jenting

This is still an issue with gen73, recent logs.

@jenting one way to resolve this, is to ship PVC, and avoid backup with ws-daemon altogether, you're right that stopping workspaces incur a heavy resource hit with ws-daemon.

I am going to remove this from our project, for now (putting it back to inbox) so that there is less work "in progress" so that we can focus. :wave:

kylos101 avatar Oct 28 '22 17:10 kylos101

I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:

cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": 
failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found

Tried multiple times and I'm not able to start my workspace.

svenefftinge avatar Oct 31 '22 16:10 svenefftinge

Thanks for the report. But from the error log, it's another symptom.

jenting avatar Nov 01 '22 04:11 jenting

I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:

cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": 
failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found

Tried multiple times and I'm not able to start my workspace.

Thanks for the report. But from the error log, it's another symptom.

jenting avatar Nov 01 '22 04:11 jenting

@easyCZ just experienced this after 10 minutes in https://gitpodio-gitpod-cqng3518h09.ws-eu77.gitpod.io/, he was using VS Code Desktop for the IDE via the unstable channel.

workspace logs webapp logs

kylos101 avatar Nov 28 '22 15:11 kylos101

@easyCZ it looks like:

  1. the IDE loop ended
  2. But I'm unsure why. Perhaps this is a start for debugging? @akosyakov @utam0k wdyt?
  3. FYI, I checked journalctl logs on the node too, but, nothing promising was evident for this workspace. :cry:

kylos101 avatar Nov 28 '22 15:11 kylos101

FYI: This error can also be seen rarely in integration tests. Now we are ignoring this error to pass the integration stably. https://github.com/gitpod-io/gitpod/blob/116ea559bc0979227467c9e30c717e786c2bee97/test/pkg/integration/workspace.go#L733

My first concern is whether the backup is complete. How about checking that first? I feel that would change the priority of this issue.

utam0k avatar Nov 28 '22 23:11 utam0k

@utam0k @kylos101 Hey! Me, @WVerlaek, and @jenting are doing the refinement and have a couple of questions (mentioned in the description). Do you know the answer to any of them?

atduarte avatar Jan 03 '23 08:01 atduarte

We see the log The container could not be located when the pod was deleted. The container used to be Running within the GCP log. (Under the jsonPayload.conditions.failed field).

jenting avatar Jan 04 '23 05:01 jenting

This issue increases the metric workspace_stops_total{reason="failed"} which affects our stop workspace SLO.

We can find it by comparing the Grafana and the GCP log.

image image

Note We have duplicated log, so we should treat the same time period log (within one second) as a stopping workspace failed. Good news is the backup is complete which means no data loss jsonPayload.conditions.finalBackupComplete=true.

jenting avatar Jan 04 '23 07:01 jenting

@jenting can you update the description for https://github.com/gitpod-io/gitpod/issues/12021? It seems like you verified there is no data loss?

Also, regarding Milan's case, that is old, back when there was trouble in the IDE space causing workspaces to start slowly. I am skeptical it will provide value.

Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?

kylos101 avatar Jan 05 '23 15:01 kylos101

I checked over the last seven days GCP log.

There is only one instance without jsonPayload.conditions.finalBackupComplete=true. Checking the detailed GCP log, I am skeptical that it's a data loss case because the start workspace and stop workspace request is within a minute. The workspace never seems to have gone into a Running state.

jenting avatar Jan 09 '23 05:01 jenting

Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?

The traces [1], [2], and [3] indicate the workspaces are stopped correctly 🤔

jenting avatar Jan 10 '23 06:01 jenting

There are some reasons that the kubelet evicts the pods. We add disk and memory pressure tolerate to indefinitely, and network unavailable tolerate to 30 seconds. Refer to https://github.com/gitpod-io/gitpod/blob/1512a765b1f30c6ba9b1caae8fe5931fcb628145/components/ws-manager/pkg/manager/create.go#L544-L563

For the node not ready and unreachable tolerate to 300 seconds is also applied to the workspace pod.

jenting avatar Jan 10 '23 07:01 jenting