service-fabric icon indicating copy to clipboard operation
service-fabric copied to clipboard

[BUG] - dockerd process randomly stops working preventing Service Fabric from running container applications

Open levimatheri opened this issue 2 years ago • 5 comments

Describe the bug We are experiencing an issue with Service Fabric container on one of our nodes. During application upgrade, we got the following errors below. The dockerd daemon shows as running in task manager, but trying docker commands like 'docker info' does not work and shows 'ERROR: Cannot connect to the Docker daemon at npipe:////./pipe/docker_engine. Is the docker daemon running?' This issue never resolves. We tried to kill dockerd.exe but got 'Access denied'. This has been working just fine until now.

FinishDeactivateContainer: Error=0x80131500, Message=System.Fabric.FabricException (-2146233088) Deactivation failed for ContainerName='sf-150-63aa1a55-8c98-415d-bd0c-b0f51ac8e306_9c1a4270-c552-4608-be56-dfa20ec572be'. System.Fabric.FabricException (-2147017729) Docker operation timed out. RequestUri=http://docker_engine/containers/sf-150-63aa1a55-8c98-415d-bd0c-b0f51ac8e306_9c1a4270-c552-4608-be56-dfa20ec572be/kill

FinishDeactivateContainer: Error=0x80131500, Message=System.Fabric.FabricException (-2146233088) Deactivation failed for ContainerName='sf-157-b546cfed-8d72-4729-bcdf-1dc0c97af198_b20a5bc3-aa63-4bad-9080-427441c3f8f4'. System.Fabric.FabricException (-2147017729) Failed to connect to DockerService at named pipe 'docker_engine'. IsDockerServiceManagedBySF=TrueOrignalException=System.IO.IOException: The semaphore timeout period has expired.

at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.Pipes.NamedPipeClientStrea

Any ideas? Area/Component: Container hosting

To Reproduce This just happened randomly, we have no way of reproducing

Expected behavior Containers should run reliably, and dockerd should not stop working indefinitely

Observed behavior: Dockerd shows as running, but none of the docker commands work, therefore Service Fabric container applications don't run

Service Fabric Runtime Version: 9.0.1028.9590

Environment:

  • Azure
  • OS: Windows 2019 Datacenter With Containers

Assignees: /cc @microsoft/service-fabric-triage @craftyhouse @sukanyamsft

levimatheri avatar Jul 22 '22 03:07 levimatheri

We see the same thing running 2019 Datacenter Core with Containers in our SF clusters. Although, we are using dockerd directly and not via the Service Fabric containers integration. So, I don't think this is something specific to SF (but maybe?).

Right now, we mitigate the issue by automatically restarting dockerd on too many semaphore exceptions. This is painful because restarting also takes the containers down with it.

Our plan is to move to Windows Server 2022 which, IIRC, gets us the latest dockerd code. If we still see symptoms, the plan is to dump dockerd prior to restarting and try to understand why it's becoming unresponsive. I think it will likely result in a bug on dockerd.

If the SF team doesn't ack this issue, feel free to @ me here in a couple months for an update on what we found.

colathro avatar Jul 28 '22 16:07 colathro

@colathro thanks for your comment, glad to know it's not just us. Could you elaborate on how you 'automatically restart dockerd'? We've tried to kill it manually from task manager and it just says 'Access denied'.

levimatheri avatar Jul 28 '22 17:07 levimatheri

@levimatheri

The service which does the restarting of the dockerd process runs with permissions that have the access to do it. That same service does a whole bunch of other disk/networking manipulation - so it's likely somewhat close to Administrator.

colathro avatar Jul 28 '22 17:07 colathro

@colathro What version of 2019-Datacenter-with-Containers are you on? We just had an auto OS upgrade recently which bumped us to 17763.3165.220706. The docker errors seem to have stopped showing up, but we're not sure if it has to do with the OS upgrade.

levimatheri avatar Aug 09 '22 16:08 levimatheri

@colathro Nevermind spoke too soon 😅 it happened again

levimatheri avatar Aug 12 '22 03:08 levimatheri

Please file a support case if you are still facing this issue https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-support#create-an-azure-support-request

Also be aware of the following impact for customers using -with-containers images. https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Deployment/Mirantis-Guidance.md

craftyhouse avatar Dec 07 '22 18:12 craftyhouse