service-fabric
service-fabric copied to clipboard
[BUG] - dockerd process randomly stops working preventing Service Fabric from running container applications
Describe the bug We are experiencing an issue with Service Fabric container on one of our nodes. During application upgrade, we got the following errors below. The dockerd daemon shows as running in task manager, but trying docker commands like 'docker info' does not work and shows 'ERROR: Cannot connect to the Docker daemon at npipe:////./pipe/docker_engine. Is the docker daemon running?' This issue never resolves. We tried to kill dockerd.exe but got 'Access denied'. This has been working just fine until now.
FinishDeactivateContainer: Error=0x80131500, Message=System.Fabric.FabricException (-2146233088) Deactivation failed for ContainerName='sf-150-63aa1a55-8c98-415d-bd0c-b0f51ac8e306_9c1a4270-c552-4608-be56-dfa20ec572be'. System.Fabric.FabricException (-2147017729) Docker operation timed out. RequestUri=http://docker_engine/containers/sf-150-63aa1a55-8c98-415d-bd0c-b0f51ac8e306_9c1a4270-c552-4608-be56-dfa20ec572be/kill
FinishDeactivateContainer: Error=0x80131500, Message=System.Fabric.FabricException (-2146233088) Deactivation failed for ContainerName='sf-157-b546cfed-8d72-4729-bcdf-1dc0c97af198_b20a5bc3-aa63-4bad-9080-427441c3f8f4'. System.Fabric.FabricException (-2147017729) Failed to connect to DockerService at named pipe 'docker_engine'. IsDockerServiceManagedBySF=TrueOrignalException=System.IO.IOException: The semaphore timeout period has expired.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.Pipes.NamedPipeClientStrea
Any ideas? Area/Component: Container hosting
To Reproduce This just happened randomly, we have no way of reproducing
Expected behavior Containers should run reliably, and dockerd should not stop working indefinitely
Observed behavior: Dockerd shows as running, but none of the docker commands work, therefore Service Fabric container applications don't run
Service Fabric Runtime Version: 9.0.1028.9590
Environment:
- Azure
- OS: Windows 2019 Datacenter With Containers
Assignees: /cc @microsoft/service-fabric-triage @craftyhouse @sukanyamsft
We see the same thing running 2019 Datacenter Core with Containers in our SF clusters. Although, we are using dockerd directly and not via the Service Fabric containers integration. So, I don't think this is something specific to SF (but maybe?).
Right now, we mitigate the issue by automatically restarting dockerd on too many semaphore exceptions. This is painful because restarting also takes the containers down with it.
Our plan is to move to Windows Server 2022 which, IIRC, gets us the latest dockerd code. If we still see symptoms, the plan is to dump dockerd prior to restarting and try to understand why it's becoming unresponsive. I think it will likely result in a bug on dockerd.
If the SF team doesn't ack this issue, feel free to @ me here in a couple months for an update on what we found.
@colathro thanks for your comment, glad to know it's not just us. Could you elaborate on how you 'automatically restart dockerd'? We've tried to kill it manually from task manager and it just says 'Access denied'.
@levimatheri
The service which does the restarting of the dockerd process runs with permissions that have the access to do it. That same service does a whole bunch of other disk/networking manipulation - so it's likely somewhat close to Administrator.
@colathro What version of 2019-Datacenter-with-Containers are you on? We just had an auto OS upgrade recently which bumped us to 17763.3165.220706. The docker errors seem to have stopped showing up, but we're not sure if it has to do with the OS upgrade.
@colathro Nevermind spoke too soon 😅 it happened again
Please file a support case if you are still facing this issue https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-support#create-an-azure-support-request
Also be aware of the following impact for customers using -with-containers images. https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Deployment/Mirantis-Guidance.md