akka.net icon indicating copy to clipboard operation
akka.net copied to clipboard

Classic Akka Remoting limitations (things to consider)

Open ismaelhamed opened this issue 4 years ago • 3 comments

So far, we've identified some worrisome limitations in the current Akka Remoting implementation. Gladly they don't show up often, but when they do it's difficult to deal with them:

  • This race condition depicted in #24654 that was never fixed. As soon as it happens, nodes get marked as TERMINATED.

  • While there are quarantine nodes present in the cluster, if you happen to have to do some managing (i.e., cleaning some unreachable nodes or making a node leave for an upgrade) "something" somehow stops any LEAVE or DOWN from propagating through the cluster. You must deal the quarantined nodes first.

    And the only way we've found so far to gracefully make those quarantined nodes leave the cluster is by shutting down the .NET processes and let the Coordinated Shutdown do its magic (a LEAVE from within).

  • In classic remoting, when two nodes quarantine each other no messages will go thru between them anymore. Which makes it impossible to deal with the situation in an automated way (leveraging the SBR, for instance).

    There's a new feature in the SBR to allow Quarantined nodes to shutdown themselves automatically but is Artery only.

There are a few more issues related to Akka Remoting but fixing them without a major re-write and without introducing other regressions does not seem like an option, especially now that classic Akka Remoting is deprecated in favor of Artery anyway.

ismaelhamed avatar Feb 03 '21 10:02 ismaelhamed

While there are quarantine nodes present in the cluster, if you happen to have to do some managing (i.e., cleaning some unreachable nodes or making a node leave for an upgrade) "something" somehow stops any LEAVE or DOWN from propagating through the cluster. You must deal the quarantined nodes first.

In classic remoting, when two nodes quarantine each other no messages will go thru between them anymore. Which makes it impossible to deal with the situation in an automated way (leveraging the SBR, for instance).

Yes, this is a nuisance. I strongly agree.

The plan for v1.5 is to get an Artery implementation that should be production ready, even though it won't be the default yet (need to give users some time to cut-over from Akka.Remote classic) - however, we might be able to backport that SBR issue to cover classic remoting. Maybe that's worth exploring?

Aaronontheweb avatar Feb 03 '21 15:02 Aaronontheweb

AFAIK the SBR downing of quarantined nodes cannot be done in classic remoting because of point 3 I made. Take a look at this comment:

https://github.com/johanandren/akka/blob/0dc3a0357e79c3910e2f38c4579177316bada2d6/akka-cluster/src/multi-jvm/scala/akka/cluster/DowningWhenOtherHasQuarantinedThisActorSystemSpec.scala#L59

ismaelhamed avatar Feb 04 '21 07:02 ismaelhamed

@ismaelhamed an old issue we never got around to implementing https://github.com/akkadotnet/akka.net/issues/3440

What do you think about adding an option to disable quarantines altogether?

Aaronontheweb avatar Feb 05 '21 22:02 Aaronontheweb