Windows-Containers icon indicating copy to clipboard operation
Windows-Containers copied to clipboard

Pods stuck in "Terminating" state on Windows nodes when deleting the pod after docker restart

Open ilanitr opened this issue 2 years ago • 33 comments

We are using MCR version 20.10.9 (https://docs.mirantis.com/mcr/20.10/release-notes/20-10-9.html) and we have seen that after restarting the docker service and deleting a pod, the pod gets stuck at "Terminating" state. Describing the pod shows:

container ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9: driver "windowsfilter" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\ProgramData\docker\windowsfilter\ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9 C:\ProgramData\docker\windowsfilter\ecddba6213ecd857171e6af11988979b5f0d16629509ad6f42385147ab24f2d9-removing: Access is denied.

Kubelet logs:

I0309 16:38:47.933233    2040 kuberuntime_container.go:661] "Killing container with a grace period override" pod="default/windows-log-test-6c9956b86d-t5jcp" podUID=9afb7084-8a8d-4239-8d7c-18176c80ae21 containerName="winloggertest" containerID="docker://8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2" gracePeriod=30
I0309 16:38:47.933773    2040 fake_topology_manager.go:47] "RemoveContainer" containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:38:47.945199    2040 remote_runtime.go:296] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:38:47.945199    2040 kuberuntime_gc.go:146] "Failed to remove container" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"

Using kubectl delete pod --force did clean the pod from the cluster but according to kubelet logs, it keeps trying to remove the old container with no luck:

I0309 16:50:48.263635    2040 kuberuntime_container.go:661] "Killing container with a grace period override" pod="default/windows-log-test-6c9956b86d-t5jcp" podUID=9afb7084-8a8d-4239-8d7c-18176c80ae21 containerName="winloggertest" containerID="docker://8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2" gracePeriod=30
I0309 16:50:48.264169    2040 fake_topology_manager.go:47] "RemoveContainer" containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:50:48.275910    2040 remote_runtime.go:296] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"
E0309 16:50:48.275910    2040 kuberuntime_gc.go:146] "Failed to remove container" err="rpc error: code = Unknown desc = failed to remove container \"8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\": Error response from daemon: container 8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2: driver \"windowsfilter\" failed to remove root filesystem: failed to detach VHD: failed to open virtual disk: The process cannot access the file because it is being used by another process.: rename C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2 C:\\ProgramData\\docker\\windowsfilter\\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2-removing: Access is denied." containerID="8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2"

Searching for handles on this path shows that System process holds sandbox.vhdx:

System	4	File	C:\ProgramData\docker\windowsfilter\8a26af90e71ec6eb8b8015253d6b76001d5efde3bd26e587eef9b8a8218ed2e2\sandbox.vhdx

Environment:

  • Kubernetes version: v1.21.5
  • OS version: Windows Server 2019 Datacenter 10.0.17763.2565
  • Docker version: docker://20.10.9

ilanitr avatar Mar 09 '22 17:03 ilanitr

Is this Kubernetes cluster running in AKS or elsewhere? It looks kind of similar to #106, although that issue appeared related to cleaning up a logfile and your issues does not. @kevpar or @dcantah do you have any insights here?

cwilhit avatar Mar 12 '22 00:03 cwilhit

The issue is reproduced on GKE on-prem cluster running in vSphere 7.0.

ilanitr avatar Mar 14 '22 16:03 ilanitr

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Apr 14 '22 16:04 ghost

We are still experiencing this issue. Could you reproduce this issue?

ilanitr avatar Apr 14 '22 18:04 ilanitr

Hi, just following up to let you know we are currently investigating a similar issue. Nothing more to report right now though.

cwilhit avatar Apr 21 '22 17:04 cwilhit

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar May 22 '22 16:05 ghost

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Jun 22 '22 16:06 ghost

Hey @ilanitr, we're in the middle of a fix for this and tracking a similar issue internally (internal ID: 31661934)

fady-azmy-msft avatar Jun 29 '22 01:06 fady-azmy-msft

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Jul 29 '22 16:07 ghost

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Aug 30 '22 16:08 ghost

No updates to share yet, but we're making progress on the fix for this.

fady-azmy-msft avatar Aug 31 '22 16:08 fady-azmy-msft

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Oct 01 '22 16:10 ghost

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Oct 31 '22 16:10 ghost

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Dec 01 '22 16:12 ghost

We've already worked on a fix, but we don't have timelines to share yet of when this will be available.

fady-azmy-msft avatar Dec 01 '22 19:12 fady-azmy-msft

This issue has been open for 30 days with no updates. , please provide an update or close this issue.

ghost avatar Jan 01 '23 16:01 ghost

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

@fady-azmy-msft you were the last to comment on this a while ago - has there been any progress on this item?

jeffwmiles avatar Jul 19 '23 22:07 jeffwmiles

@jeffwmiles yes, the fix was validated internally and we expect the fix to roll out soon. I don't have a timeline to share unfortunately since this depends on our backporting priority and process but I will update the thread when it comes out.

fady-azmy-msft avatar Aug 16 '23 17:08 fady-azmy-msft

I took am looking forward to a fix.

braveness23 avatar Aug 16 '23 21:08 braveness23

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.