vertx-ignite icon indicating copy to clipboard operation
vertx-ignite copied to clipboard

__vertx.subs has stale entries for downed nodes after ungraceful shutdowns

Open dspangen opened this issue 6 months ago • 11 comments

This is a continuation of https://github.com/vert-x3/vertx-ignite/issues/94 as it's hitting us in production. Looking into it, it seems the culprit is in handling a node leave event--the subscriptions are cleared by the cluster member that gets to delete the node info from the cache. But that can take a little while during which the member that is handling the removal (essentially the leader for this event) could crash or be shutdown, leaving the subs map in a bad state.

I think the solution here is to add a status field to IgniteNodeInfo of {STARTED, STOPPING} and mark the entry for a node as stopping. Then, whoever gets to update that entry gets the first chance to clear the subs. There would also be a background executor service that would poll for stopping members, maybe synchronized by a sempaphore id'd to the removed member.

Does this seem like a reasonable solution to the problem?

Version

4.5.7

dspangen avatar Jul 30 '24 21:07 dspangen