docs icon indicating copy to clipboard operation
docs copied to clipboard

Documentation Needed -- Guidance for Business Continuity and Disaster Recovery for Dapr Workflows

Open georgestevens99 opened this issue 1 year ago • 1 comments

In what area(s)?

/area runtime

/area operator

/area placement

/area docs

/area test-and-release

Describe the feature

This was extracted from Issue 8590 on 3/23/25.

Essentially resurrecting crashed workflows manually is just one possible activity in a disaster recovery scenario involving Dapr Workflows.

A disaster is when something causes a system and/or its state stores to stop working and become inoperable. The remedy for this is conducting some sort of disaster recovery activities, including failover (maybe automatically) to another geographic location or systems.

It would be most helpful to have some Dapr Docs guidelines for BCDR, aka Business Continuity and Disaster Recovery:

Specifically, scenarios for Dapr Workflow Disaster Recovery, an overview of approaches including the following topics: o Approaches for Failover of Workflow State, Failover of Application State, and other high level concepts.

o Designing Dapr Workflows so as to enable speedy Disaster Recovery. Things to do, and not to do.

o How to edit the Workflow State Store to enable Disaster Recovery.

Finally, note that there currently exists a number of documents and videos concerning how to use Dapr's resiliency related features. Resiliency Features act to help prevent crashes and Dapr's resiliency features are very rich and well tested.

But, sometimes things happen that overwhelm or neutralize even the best Resiliency Features. Like data center outages and communications outages that prevent failover to other data centers. Such outages are typically caused by weather emergencies or human error. Take a look at the outage logs of Azure or AWS or Google to see how common major outages are. These are the situations in which robust designs and good documentation (as outlined above) have the potential to significantly reduce the time it takes to recover the system(s) and get it back in operation. And that is the basic intent of this issue plus the related Disaster Recovery topics above.

Googling "Dapr Disaster Recovery" currently yields few, if no, useful results specifically concerning Dapr. It would be great if an overview article in Dapr Docs named something like "Business Continuity and Disaster Recovery for Dapr" were to show up at the top of this search.

Release Note

RELEASE NOTE:

georgestevens99 avatar Mar 24 '25 00:03 georgestevens99

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot avatar Jun 22 '25 01:06 dapr-bot