node icon indicating copy to clipboard operation
node copied to clipboard

Failure Scenarios

Open lautarodragan opened this issue 7 years ago • 1 comments

List and make tickets for the different failure scenarios.

Examples:

  • Application process is force-killed, the application doesn't receive an kill even by which to finish what it's doing.
  • RabbitMQ goes down
  • MongoDB goes down
  • IPFS goes down
  • Po.et Node comes back up, but the overall state is unknown, since messages could have been lost. Review discrepancies in the information and run a self-healing process?
  • Po.et Node should stay up and running if a service it depends on goes down - how?
  • Bitcoin Blockchain unavailable — reading from and writing to it impossible. Should report on it but stay alive, and keep retrying until it can sync.

lautarodragan avatar Nov 24 '17 15:11 lautarodragan

if everything is in a queue, i recommend just force crashing for any error. when the process (or worker using cluster restarts), it can be reprocessed. joyent talks a lot about crash-oriented design on their site.

servicebus handles this abstraction with rabbitmq and also setting up error queues, as well as an integration with redis for caching the messages to be retried.

patrickleet avatar Dec 31 '17 10:12 patrickleet