vertx-ignite
vertx-ignite copied to clipboard
__vertx.subs has stale entries for downed nodes after ungraceful shutdowns
This is a continuation of https://github.com/vert-x3/vertx-ignite/issues/94 as it's hitting us in production. Looking into it, it seems the culprit is in handling a node leave event--the subscriptions are cleared by the cluster member that gets to delete the node info from the cache. But that can take a little while during which the member that is handling the removal (essentially the leader for this event) could crash or be shutdown, leaving the subs map in a bad state.
I think the solution here is to add a status field to IgniteNodeInfo
of {STARTED, STOPPING}
and mark the entry for a node as stopping. Then, whoever gets to update that entry gets the first chance to clear the subs. There would also be a background executor service that would poll for stopping members, maybe synchronized by a sempaphore id'd to the removed member.
Does this seem like a reasonable solution to the problem?
Version
4.5.7