dfuse-eosio icon indicating copy to clipboard operation
dfuse-eosio copied to clipboard

When the connection to etcd is broken, dfuse search fails to update its connection

Open matthewdarwin opened this issue 4 years ago • 0 comments

When the connection to etcd is broken and etcd is replaced by a different instance, dfuse search fails to update its connection and stays broken. Also the health reports as "healthy" so monitoring when this situation occurs is challenging.

One possible solution:

Add a mechanism that detects that the GRPC connection to etcd was broken and just exit and wait to get restarted by k8s or systemd or whatever.

Scenario is probably something like this:

  1. archive A tells etcd that it serves blocks 1000->2000 (BUT THAT ETCD IS GONE, REPLACED BY NEW REBUILT CLUSTER !!!)
  2. router checks etcd, reads this and sends a query to archive A down to block 1000 (BUT THAT ETCD IS GONE, SO NO UPDATES !!!)
  3. archive A says: hey I don't have block 1000, my lowest block is 1100 ("I TRIED TO TELL YOU VIA ETCD BUT MY UPDATE IS STALLED")
  4. Manually restart the router and archives
  5. they connect to the new etcd and that's all good

matthewdarwin avatar Oct 06 '20 16:10 matthewdarwin