[Workflow] GRPC connection to workflow runtime doesn't self-heal when app restarts
cc @philliphoff
runtime 1.13.2 (not tried any other versions)
Expected Behavior
The grpc connection to the workflow runtime will reestablish after the app process (not dapr process) crashes and is restarted.
Actual Behavior
The grpc connection to the workflow runtime does not reestablish after the app process (not dapr process) crashes and is restarted.
Steps to Reproduce the Problem
Pull down my repro here https://github.com/olitomlinson/dapr-workflow-examples
- run
docker compose -f compose-1-instance-3-schedulers.yml build - run
docker compose -f compose-1-instance-3-schedulers.yml up - stop the
appcontainer in compose - it will be named something likeworkflow-app-a-1 - start the
appcontainer in compose - observe the logs in
workflow-app-a-1and you will see the following error repeating forever :
The gRPC server for Durable Task gRPC worker is unavailable. Will continue retrying.
Release Note
RELEASE NOTE:
This may have been fixed already in 1.14 as part of pulling in some fixes in durabletask-go. @olitomlinson are you able to verify?
This may have been fixed already in 1.14 as part of pulling in some fixes in
durabletask-go. @olitomlinson are you able to verify?
Still an issue in 1.14.4
I find this confusing. For the go-sdk I made the client to infinitely retry the worker connection to dapr, and I think we should have that behavior on every SDK, I believe python already has it.
@olitomlinson Do you know if this is still an issue with the 1.15 RC?
@olitomlinson Do you know if this is still an issue with the 1.15 RC?
@WhitWaldo yes, just tested on 1.15.0-rc.7, and it still exhibits the same behavior :(
@WhitWaldo is this still in progress & if so, can you provide an update?
@cicoyle I've been investigating options for this locally and while I've made some progress in building out richer debugging tooling and logs to tackling the similar reported issue, I have not yet identified a solid path forward. WIP.
This is still a problem on 1.15.6-rc.5 / dotnet sdk 1.16.0-rc03
The gRPC server for Durable Task gRPC worker is unavailable. Will continue retrying.
Still not self-healing in runtime 1.16.0-rc.2 / dotnet sdk 1.16.0-rc05
This could be a real problem in the wild IMO -- What happens if in a kubernetes deployment the app container crashes, and is subsequently restarted (as per the kubelet)? This would not self-heal until something triggers a restart of the dapr container, leading to a period of time where the pod is just not advancing any workflows forward.
Still not self-healing in runtime 1.16.0-rc.25/ dotnet sdk 1.16.0-rc15