otp
otp copied to clipboard
prevent_overlapping_partitions in larger clusters
Describe the bug
We have a cluster of 60+ nodes and upon upgrading to otp 25 we hit loads of global disconnect messages. We attempted to resolve this by increasing net_setuptime
which helped to delay the issues, but ultimately they reoccurred again
We were able to get things running again by doing prevent_overlapping_partitions
however this ultimately failed as well and we had to rollback to otp-24
To Reproduce This is difficult to reproduce because it's a large cluster issue meaning it's expensive to reproduce
Expected behavior No disconnect messages when in a large cluster
Affected versions 25 Only, downgrading to 24 solved the issue
Additional context This is our post mortem from the outage if it helps https://theblitzapp.notion.site/Cluster-Health-Issue-02829d3cd5304baa94c4a7c4540fbc20
If you disable prevent_overlapping_partitions
, you should not get any disconnect messages and global will behaves as default before OTP 25. If you still have disconnect messages in the system, you have failed to disable it properly. How have you disabled it?
However, note that the prevent_overlapping_partitions
feature is a bug fix. global
itself depends on it in order to ensure that its internal state will remain consistent. Other applications that depends on a fully connected mesh of nodes also depends on it, as for example mnesia
. That is, if you disable prevent_overlapping_partitions
you'll have to be prepared for potential issues both with global
and other applications.
Things that you might want to try in order to keep prevent_overlapping_partitions
enabled:
- Increase the distribution buffer busy size. Since you allow a larger distribution buffer, the
erlang:send()
call ingns_volatile_send()
will less frequently returnnosuspend
which in turn will cause less processes to be spawned. The default distribution buffer busy size is also quite small as default, so you might want to increase it anyway (actually quite a lot) in order to improve throughput. Increasing it may, however, increase memory usage. Note that the distribution buffer busy size does not cause a buffer of that size to be allocated. It only sets a maximum size of allocated distribution buffers per connection. - Increase the maximum amount of processes allowed.
Thank you for all this information!
We disabled it with a kernel flag but it's possible we missed a VM during the deploy and there may have been a stray node in the cluster. We'll attempt this again.
What type of issues would potentially come up with global
and other applications we should be aware of when disabling prevent_overlapping_partitions
We'll also try increasing the cap with these flags and see if we can manage to run our cluster
What type of issues would potentially come up with
global
and other applications we should be aware of when disablingprevent_overlapping_partitions
Regarding global
: Name registration may become inconsistent, so that you got different processes registered under the same name on different nodes, and locks wont be set on nodes that are not directly reachable from the node that is locking.
Regarding mnesia
perhaps @dgud can give information?
On the top of my head I cannot point to any other applications in OTP that depends on a fully connected mesh, but there may be other such applications.
Since global
has provided the "fully connected mesh of nodes" feature (even though it didn't in the presence of connection failures) for ages it is not unlikely that there are other non-OTP applications that depend on the fully connected mesh as well, and what the consequences will be for those I cannot say.
Hey @rickard-green, I brought this back to the team and there were some concerns raised.
Upon looking at our metrics, it actually shows us hitting memory caps on the machines post our locking fix and so we're concerned that doing an upgrade again will only cause us to run out of memory by increasing vm settings that will effect our memory. Even though distribution buffer busy size
does increase our memory if it's decreasing the process spawn, will that be sufficient to avoid the OOM errors?
We're also curious in just a general sense why upgrading OTP versions would cause us to have to change VM parameters and if that's a concern at all, given pre OTP 25 everything works out of the box without any VM tuning on our end
Mnesia may hang or data may be inconsistent between nodes if the network if not all nodes can communicate. Mnesia expects that if one node can talk to another node all nodes can send messages to that node and if one node goes down (or loses the network) all nodes can discover that it is lost.
Hey @rickard-green, I brought this back to the team and there were some concerns raised.
Upon looking at our metrics, it actually shows us hitting memory caps on the machines post our locking fix and so we're concerned that doing an upgrade again will only cause us to run out of memory by increasing vm settings that will effect our memory. Even though
distribution buffer busy size
does increase our memory if it's decreasing the process spawn, will that be sufficient to avoid the OOM errors?
I cannot give you a straight answer on that. The outgoing messages that will be buffered in the distribution buffer will consume a lot less memory than buffering these messages in their own separate processes, but if that is enough is very hard to say.
We're also curious in just a general sense why upgrading OTP versions would cause us to have to change VM parameters and if that's a concern at all, given pre OTP 25 everything works out of the box without any VM tuning on our end
In general we do not introduce incompatibilities in patches, but only in new releases (i.e. when the major OTP version changes). However, a bugfix might cause an incompatibility even in a patch if the only alternative in order to fix the bug is to introduce an incompatibility. This bugfix qualifies as such a bugfix. We did, however, decide not to enable this fix by default in patches on already existing releases. The fix is however available in patches on OTP 22, 23, and 24, and can be enabled on those releases if you desire. As of OTP 25 we enabled this fix as default, but it can still be disabled if the user desires that, and flagged it as a potential incompatibility in the release notes. It is hard to introduce incompatibilities any smoother than that.
If you are ok with the potential issues that might arise when disabling prevent_overlapping_partitions
, all you have to do is to disable it on all nodes, then global will behave as before.
Closing this. Please reopen if you have more questions.
Hey @rickard-green , We've attempted an upgrade again and this time we're met with the same errors. net_setuptime has been set to 30 seconds and we've doubled the distribution busy buffer but these messages still occur every few minute