AKS
AKS copied to clipboard
Windows pod stuck in terminating state with subpath in volumemounts
Describe the bug Windowds based node Pod is stuck in terminating state during deletion/replace of pod whenever I add two volume mounts with subpath. This does not happen if I add single volume mount with subpath. To Reproduce Steps to reproduce the behavior:
- Create a deployment with below volume and volumemounts. Make sure to create the configmap as well.
volumeMounts:
- name: configs
mountPath: C:\inetpub\wwwroot\web.config
subPath: web.config
- name: cloudconfig
mountPath: C:\CloudConfig
subPath: credentials
volumes:
- name: configs
configMap:
name: test-report-executor-cm
- name: cloudconfig
configMap:
name: cloud-config
- Try running below command: kubectl replace -n namespace -f filename.yaml --force
Expected behavior
pod should not be stuck in terminating state. The pod is stuck at terminating state for more than 5d now.
Screenshots
Environment (please complete the following information):
- CLI Version - v1.27.7
- Kubernetes version - 1.27.7
Additional context kubelet logs:
E0612 15:37:32.338453 4496 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/X17e892a03e1c-configs podName:X nodeName:}" failed. No retries permitted until 2024-06-12 15:37:33.3384537 +0000 UTC m=+1476.528238001 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "configs" (UniqueName: "kubernetes.io/configmap/X-configs") pod "XXXX4744-X" (UID: "X") : remove c:\var\lib\kubelet\pods\X\volumes\kubernetes.io~configmap\configs..2024_06_12_15_34_06.2642743558\Web.config: The process cannot access the file because it is being used by another process.
We have the same issue (Sitecore windows containers). From our investigation, our analysis of the problem so far is ;
- Container runs as user 'containeradministrator' and once the configmaps for our certs are mounted, it can't dismount due to user permissions .
- we also seem to have this issue with more than one volumemount
To add to the above; Our solution was to have one singular mount; a script which reads environment variables to write the contents of the target mount.
For example
mkdir c:\inetpub\wwwroot\certs\
echo "$env:test1_crt" > c:\inetpub\wwwroot\certs\test1.crt
echo "$env:test2_crt" > c:\inetpub\wwwroot\certs\test2.crt
try {
ForEach($file in Get-ChildItem -Path c:\inetpub\wwwroot\certs\*.crt)
{
Import-Certificate -FilePath $file.FullName -CertStoreLocation Cert:\LocalMachine\Root
Import-Certificate -FilePath $file.FullName -CertStoreLocation Cert:\LocalMachine\Ca
}
}
catch {
Write-Host "An error occurred:"
Write-Host $_
}
By doing so, the pods would terminate as expected and our end result was still achieved.
I can confirm this issue.
If you mount two volumes with a subpath (in our case one of them is a configmap and the other one is an Azure Files share), the pod stays endlessly in Terminating state.
I wanted to pinpoint what subsystem was responsible for the issue, so I grabbed the default IIS image from MS:
mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2022
And tried mounting a configmap with subpath, everything OK. Then mounted the same configmap with a subpath but on another location, pod cannot terminate.
volume_mount {
name = "env-config"
mount_path = "c:/env.json"
sub_path = "env.json"
}
volume_mount {
name = "env-config"
mount_path = "c:/env2.json"
sub_path = "env.json"
}
Kubelet logs:
E0712 07:36:38.858534 6376 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/configmap/9ebd5403-2617-4795-b3fe-7a483a723448-env-config podName:9ebd5403-2617-4795-b3fe-7a483a723448 nodeName:}" failed. No retries permitted until 2024-07-12 07:36:39.3577023 +0000 UTC m=+1705.940224101 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "env-config" (UniqueName: "kubernetes.io/configmap/9ebd5403-2617-4795-b3fe-7a483a723448-env-config") pod "9ebd5403-2617-4795-b3fe-7a483a723448" (UID: "9ebd5403-2617-4795-b3fe-7a483a723448") : remove c:\var\lib\kubelet\pods\9ebd5403-2617-4795-b3fe-7a483a723448\volumes\kubernetes.io~configmap\env-config\..2024_07_12_07_11_11.3010365152\env.json: The process cannot access the file because it is being used by another process.
In this example I am using the same configmap voulme twice, but originally this was happening with a ConfigMap + an Azure File Share, which are totally unrelated storages.
This issue seems to only affect Windows Containers, as I tested the same setup with a Linux container and could not reproduce.
It seems like to be as same as https://github.com/kubernetes/kubernetes/issues/112630
It is expected to be fixed in kubelet
cc @jsturtevant
there is another workaround for someone wants to use it to import certs volumeMounts to separate folders without subPath volumeMounts: - name: certs-path-1 mountPath: /certs/1 - name: certs-path-2 mountPath: /certs/2
with volumes like - name: certs-path-1 configMap: name: trust-bundle-ex items: - key: certs-ex.pem path: certs-ex.pem - name: certs-path-2 configMap: name: trust-bundle items: - key: certs.pem path: certs.pem
Be aware that files created in those locations "/certs/1/certs-ex.pem" are not regular files (even if you can get content via Get-Content or other “tools”), they are symlinks so something like "Import-Certificate -FilePath /certs/1/certs-ex.pem -CertStoreLocation Cert:\LocalMachine\Root" does NOT work on them.
You have to use the real file behind that symlink Import-Certificate -FilePath /certs/1/..data/certs-ex.pem -CertStoreLocation Cert:\LocalMachine\Root
I do not get it - how this could be closed (in 30 days) if the issue was confirmed and there is some work done or going - by @AbelHu "It is expected to be fixed in kubelet". .
@jsturtevant or @kiashok can you please provide an update.
This is still an issue and not yet to be closed.
Hello @AbelHu and @kiashok can you provide an update please.
Adding "fixing" tag to block stale/close. This fix looks to be dependent on upstream fixes in Kubernetes as per earlier comment from @AbelHu.
Hey @AbelHu @kiashok, are there any updates on this issue? Is it still in the 'fixing' phase?