gitpod
gitpod copied to clipboard
The container could not be located when the pod was terminated
Bug description
This error message(The container could not be located when the pod was terminated
) comes from kubelet.
https://github.com/kubernetes/kubernetes/blob/4aa451e8458a7cbf78ed464e9e47e87d424541ce/pkg/kubelet/kubelet_pods.go#L1810-L1817
Potentially related: https://github.com/kubernetes/kubernetes/issues/104107
Steps to reproduce
I don't know
Workspace affected
No response
Expected behavior
There isn't this error message in productionthe
Example repository
No response
Anything else?
No response
@utam0k good find, I see with gen60
, too.
Still seeing in gen63
, cc: @sagor999 for posterity, I realize you're working other things, but, not sure if this is related. gen63 Logs.
Hm. I looked at one instance that had this error: 9e368af4-a27a-48d8-9c37-19dffb785c11
and inspected traces for it (eu63) but it did not show any errors. Furthermore, it showed that workspace stopped correctly (according to traces). :thinking:
I found we can reproduce this issue with our integration test on the preview environment.
I found we can reproduce this issue with our integration test on the preview environment.
How so @utam0k ?
I found we can reproduce this issue with our integration test on the preview environment.
@utam0k can you share? 🙏
I found we can reproduce this issue with our integration test on the preview environment.
@utam0k can you share? 🙏
Just run an integration test. However, I will include a fix to ignore this issue in the integration test.
@utam0k do you mean temporary ignore? For example, so we can incrementally have tests passing in main branch, while a fix is done for the failing test? Please share a permalink, referencing the part you plan to ignore.
@kylos101 Sorry for my lack of information. I'm going to put this if statement into the main branch. Please note the link I shared is not in the main branch. https://github.com/gitpod-io/gitpod/blob/4e824503da137d43ae901f37b972400254ca5b68/test/pkg/integration/workspace.go#L522-L523
Got this errer after a workspace timed out running on self-hosted preview env on Azure with a build from today.
FYI that I tried to reproduce what @jldec experienced above on my (newly created, via terraform) GCP environment. I set the global timeoutDefault to 5m and had 2 different workspaces time out successfully after the 5 minutes. That of course does not mean that this never happens, but at least that this is flaky 🤔
Another occurrence (Internal)
Ran into this again. Stopped two workspaces at the same time on an AWS Self-Hosted Environment (2022.09rc6). One was fine, the other showed the The container could not be located when the pod was terminated
error. @nandajavarma will link asupport bundle with the logs. The workspace in question was: lucasvaltl-getrobotisla-clmsdohpu5r.ws.release-aws.[...]
Here is the link to the support bundle for the release-aws
cluster
From the above all reported, the issue happened when the workspace was stopping. I'm wondering if it's because of this
- the ws-daemon backing up the content to GCS
- the ws-daemon requests lots of resources from the node because backing up the content
- the kubelet evicts the workspace pod because this node is out of resource
- the workspace pod transited into an unknown state, and the error message reported
- the ws-manager catches the workspace pod terminated, and shows the error message to the user
This is still an issue with gen73
, recent logs.
@jenting one way to resolve this, is to ship PVC, and avoid backup with ws-daemon
altogether, you're right that stopping workspaces incur a heavy resource hit with ws-daemon
.
I am going to remove this from our project, for now (putting it back to inbox) so that there is less work "in progress" so that we can focus. :wave:
I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:
cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest":
failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found
Tried multiple times and I'm not able to start my workspace.
Thanks for the report. But from the error log, it's another symptom.
I've had the same error just now. When I tried to restart the workspace (gitpodio-gitpod-zjkqtz6c188) I got:
cannot pull image: rpc error: code = NotFound desc = failed to pull and unpack image "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": failed to resolve reference "reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest": reg.ws-us73.gitpod.io:20000/remote/d584ec9e-7038-401f-b93b-51b164fcdb33:latest: not found
Tried multiple times and I'm not able to start my workspace.
Thanks for the report. But from the error log, it's another symptom.
@easyCZ just experienced this after 10 minutes in https://gitpodio-gitpod-cqng3518h09.ws-eu77.gitpod.io/, he was using VS Code Desktop for the IDE via the unstable channel.
@easyCZ it looks like:
- the IDE loop ended
- But I'm unsure why. Perhaps this is a start for debugging? @akosyakov @utam0k wdyt?
- FYI, I checked
journalctl
logs on the node too, but, nothing promising was evident for this workspace. :cry:
FYI: This error can also be seen rarely in integration tests. Now we are ignoring this error to pass the integration stably. https://github.com/gitpod-io/gitpod/blob/116ea559bc0979227467c9e30c717e786c2bee97/test/pkg/integration/workspace.go#L733
My first concern is whether the backup is complete. How about checking that first? I feel that would change the priority of this issue.
@utam0k @kylos101 Hey! Me, @WVerlaek, and @jenting are doing the refinement and have a couple of questions (mentioned in the description). Do you know the answer to any of them?
We see the log The container could not be located when the pod was deleted. The container used to be Running
within the GCP log. (Under the jsonPayload.conditions.failed field).
This issue increases the metric workspace_stops_total{reason="failed"}
which affects our stop workspace SLO.
We can find it by comparing the Grafana and the GCP log.


Note We have duplicated log, so we should treat the same time period log (within one second) as a stopping workspace failed. Good news is the backup is complete which means no data loss
jsonPayload.conditions.finalBackupComplete=true
.
@jenting can you update the description for https://github.com/gitpod-io/gitpod/issues/12021? It seems like you verified there is no data loss?
Also, regarding Milan's case, that is old, back when there was trouble in the IDE space causing workspaces to start slowly. I am skeptical it will provide value.
Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?
I checked over the last seven days GCP log.
There is only one instance without jsonPayload.conditions.finalBackupComplete=true
. Checking the detailed GCP log, I am skeptical that it's a data loss case because the start workspace and stop workspace request is within a minute. The workspace never seems to have gone into a Running state.
Do you have any traces for recent occurrences of this issue that help "paint a a picture", so we can see more easily how to recreate?
The traces [1], [2], and [3] indicate the workspaces are stopped correctly 🤔
There are some reasons that the kubelet evicts the pods. We add disk and memory pressure tolerate to indefinitely, and network unavailable tolerate to 30 seconds. Refer to https://github.com/gitpod-io/gitpod/blob/1512a765b1f30c6ba9b1caae8fe5931fcb628145/components/ws-manager/pkg/manager/create.go#L544-L563
For the node not ready and unreachable tolerate to 300 seconds is also applied to the workspace pod.