Akka.Cluster.Discovery icon indicating copy to clipboard operation
Akka.Cluster.Discovery copied to clipboard

It takes Cluster Singleton 1 minute to move to another node

Open vasily-kirichenko opened this issue 6 years ago • 1 comments

  1. Consul discovery, the settings are:
akka.cluster {
  discovery {
    provider = akka.cluster.discovery.consul
    consul {
      listener-url = "http://127.0.0.1:8500"
      class = "Akka.Cluster.Discovery.Consul.ConsulDiscoveryService, Akka.Cluster.Discovery.Consul"
      dispatcher = "consul-dispatcher"
      alive-interval = 10s
      alive-timeout = 1m
      refresh-interval = 1m
      join-retries = 3
      lock-retry-interval = 250ms
      datacenter = "dc"
      token = ""
      wait-time = 30s
    }
  }               
}
  1. Three nodes cluster, a singleton is running on a node.
  2. Kill the node on which the singleton is running.
  3. A new singleton is launched after ~1 minute delay, which is unacceptable, the docs promise that it should take few seconds at most.

vasily-kirichenko avatar Mar 27 '18 09:03 vasily-kirichenko

Cluster singleton migration depends on the time of down node detection - if node is just unreachable, we cannot assume it's dead, since it may be just temporary network issue and we don't want to end with 2 singletons. Therefore we need to determine if a node is down:

  • In graceful scenario it's fast (as downing node can announce this to others).
  • In hard failure it's slow, since the rest of the cluster must detect if node is actually dead or if it just disconnected for some reason and will come back up shortly. And this takes time.

Docs probably refer to time required to migrate, once a down node has been detected. In case of consul cluster discovery, you can play with alive-timeout and refresh-interval settings to try to lower that time frame. However if I'm right consul itself requires at least 30-60s to detect an unhealty node.

Horusiath avatar Mar 27 '18 23:03 Horusiath