Race condition with node joins

Open LetThereBeDwight opened this issue 5 months ago • 1 comments

I've only specifically looked at this for the ReplicatedCache adapter (might be the only place where it's really important), but we're trying to use it and during the sync process we're seeing that syncs weren't occurring, and new nodes weren't seeing other nodes. Putting a sleep of 500ms in the init process of the ReplicatedCache bootstrap showed us that we did (eventually) see the new nodes.

We moved the caches down the supervisor child list after the libcluster config/child was added to no effect without the sleep.

We tried explicitly starting the children after the initial supervisor start to no effect without the sleep. In this same pattern, we moved the sleep out of the bootstrap and into our application before adding the children, and the bootstrap was able to see all the nodes and properly sync.

Any thoughts here on how to better handle this or if something can be done while boostrapping the ReplicatedCache adapter to ensure that libcluster's join process has completed? I see https://github.com/elixir-nebulex/nebulex/pull/232 that tackles a maybe similar issue but not in an ideal way for allowing data syncs.

Aug 12 '25 18:08 LetThereBeDwight

Hey 👋 !! First of all, thanks for spotting this issue. On the other hand, yes, I've been aware that it is unstable sometimes, and it shouldn't be. I'll see if there is something I can do in the short term to fix this issue. Still, the Nebulex v3 plan is to rewrite this replicated adapter using Mnesia precisely to avoid replication and sync issues (it is currently under development). Anyway, I'll spend some time on this and see if I can do something on the adapter side. Thanks 🙏 !!

Aug 16 '25 07:08 cabol