dotnet-sdk icon indicating copy to clipboard operation
dotnet-sdk copied to clipboard

[Workflow] GRPC connection to workflow runtime doesn't self-heal when app restarts

Open olitomlinson opened this issue 1 year ago • 15 comments

cc @philliphoff

runtime 1.13.2 (not tried any other versions)

Expected Behavior

The grpc connection to the workflow runtime will reestablish after the app process (not dapr process) crashes and is restarted.

Actual Behavior

The grpc connection to the workflow runtime does not reestablish after the app process (not dapr process) crashes and is restarted.

Steps to Reproduce the Problem

Pull down my repro here https://github.com/olitomlinson/dapr-workflow-examples

  1. run docker compose -f compose-1-instance-3-schedulers.yml build
  2. run docker compose -f compose-1-instance-3-schedulers.yml up
  3. stop the app container in compose - it will be named something like workflow-app-a-1
  4. start the app container in compose
  5. observe the logs in workflow-app-a-1 and you will see the following error repeating forever :

The gRPC server for Durable Task gRPC worker is unavailable. Will continue retrying.

Release Note

RELEASE NOTE:

olitomlinson avatar May 22 '24 18:05 olitomlinson

This may have been fixed already in 1.14 as part of pulling in some fixes in durabletask-go. @olitomlinson are you able to verify?

cgillum avatar Sep 12 '24 00:09 cgillum

This may have been fixed already in 1.14 as part of pulling in some fixes in durabletask-go. @olitomlinson are you able to verify?

Still an issue in 1.14.4

olitomlinson avatar Sep 17 '24 22:09 olitomlinson

I find this confusing. For the go-sdk I made the client to infinitely retry the worker connection to dapr, and I think we should have that behavior on every SDK, I believe python already has it.

famarting avatar Oct 10 '24 11:10 famarting

@olitomlinson Do you know if this is still an issue with the 1.15 RC?

WhitWaldo avatar Jan 28 '25 17:01 WhitWaldo

@olitomlinson Do you know if this is still an issue with the 1.15 RC?

@WhitWaldo yes, just tested on 1.15.0-rc.7, and it still exhibits the same behavior :(

olitomlinson avatar Jan 29 '25 22:01 olitomlinson

@WhitWaldo is this still in progress & if so, can you provide an update?

cicoyle avatar Jun 17 '25 16:06 cicoyle

@cicoyle I've been investigating options for this locally and while I've made some progress in building out richer debugging tooling and logs to tackling the similar reported issue, I have not yet identified a solid path forward. WIP.

WhitWaldo avatar Jun 17 '25 16:06 WhitWaldo

This is still a problem on 1.15.6-rc.5 / dotnet sdk 1.16.0-rc03

The gRPC server for Durable Task gRPC worker is unavailable. Will continue retrying.

olitomlinson avatar Jun 23 '25 18:06 olitomlinson

Still not self-healing in runtime 1.16.0-rc.2 / dotnet sdk 1.16.0-rc05

This could be a real problem in the wild IMO -- What happens if in a kubernetes deployment the app container crashes, and is subsequently restarted (as per the kubelet)? This would not self-heal until something triggers a restart of the dapr container, leading to a period of time where the pod is just not advancing any workflows forward.

olitomlinson avatar Aug 04 '25 22:08 olitomlinson

Still not self-healing in runtime 1.16.0-rc.25/ dotnet sdk 1.16.0-rc15

olitomlinson avatar Aug 29 '25 20:08 olitomlinson