otp icon indicating copy to clipboard operation
otp copied to clipboard

prevent_overlapping_partitions in larger clusters

Open MikaAK opened this issue 1 year ago • 3 comments

Describe the bug We have a cluster of 60+ nodes and upon upgrading to otp 25 we hit loads of global disconnect messages. We attempted to resolve this by increasing net_setuptime which helped to delay the issues, but ultimately they reoccurred again

We were able to get things running again by doing prevent_overlapping_partitions however this ultimately failed as well and we had to rollback to otp-24

To Reproduce This is difficult to reproduce because it's a large cluster issue meaning it's expensive to reproduce

Expected behavior No disconnect messages when in a large cluster

Affected versions 25 Only, downgrading to 24 solved the issue

Additional context This is our post mortem from the outage if it helps https://theblitzapp.notion.site/Cluster-Health-Issue-02829d3cd5304baa94c4a7c4540fbc20

MikaAK avatar Aug 11 '22 00:08 MikaAK

If you disable prevent_overlapping_partitions, you should not get any disconnect messages and global will behaves as default before OTP 25. If you still have disconnect messages in the system, you have failed to disable it properly. How have you disabled it?

However, note that the prevent_overlapping_partitions feature is a bug fix. global itself depends on it in order to ensure that its internal state will remain consistent. Other applications that depends on a fully connected mesh of nodes also depends on it, as for example mnesia. That is, if you disable prevent_overlapping_partitions you'll have to be prepared for potential issues both with global and other applications.

Things that you might want to try in order to keep prevent_overlapping_partitions enabled:

  • Increase the distribution buffer busy size. Since you allow a larger distribution buffer, the erlang:send() call in gns_volatile_send() will less frequently return nosuspend which in turn will cause less processes to be spawned. The default distribution buffer busy size is also quite small as default, so you might want to increase it anyway (actually quite a lot) in order to improve throughput. Increasing it may, however, increase memory usage. Note that the distribution buffer busy size does not cause a buffer of that size to be allocated. It only sets a maximum size of allocated distribution buffers per connection.
  • Increase the maximum amount of processes allowed.

rickard-green avatar Aug 11 '22 13:08 rickard-green

Thank you for all this information!

We disabled it with a kernel flag but it's possible we missed a VM during the deploy and there may have been a stray node in the cluster. We'll attempt this again.

What type of issues would potentially come up with global and other applications we should be aware of when disabling prevent_overlapping_partitions

We'll also try increasing the cap with these flags and see if we can manage to run our cluster

MikaAK avatar Aug 12 '22 03:08 MikaAK

What type of issues would potentially come up with global and other applications we should be aware of when disabling prevent_overlapping_partitions

Regarding global: Name registration may become inconsistent, so that you got different processes registered under the same name on different nodes, and locks wont be set on nodes that are not directly reachable from the node that is locking.

Regarding mnesia perhaps @dgud can give information?

On the top of my head I cannot point to any other applications in OTP that depends on a fully connected mesh, but there may be other such applications.

Since global has provided the "fully connected mesh of nodes" feature (even though it didn't in the presence of connection failures) for ages it is not unlikely that there are other non-OTP applications that depend on the fully connected mesh as well, and what the consequences will be for those I cannot say.

rickard-green avatar Aug 12 '22 18:08 rickard-green

Hey @rickard-green, I brought this back to the team and there were some concerns raised.

Upon looking at our metrics, it actually shows us hitting memory caps on the machines post our locking fix and so we're concerned that doing an upgrade again will only cause us to run out of memory by increasing vm settings that will effect our memory. Even though distribution buffer busy size does increase our memory if it's decreasing the process spawn, will that be sufficient to avoid the OOM errors?

We're also curious in just a general sense why upgrading OTP versions would cause us to have to change VM parameters and if that's a concern at all, given pre OTP 25 everything works out of the box without any VM tuning on our end

MikaAK avatar Aug 24 '22 21:08 MikaAK

Mnesia may hang or data may be inconsistent between nodes if the network if not all nodes can communicate. Mnesia expects that if one node can talk to another node all nodes can send messages to that node and if one node goes down (or loses the network) all nodes can discover that it is lost.

dgud avatar Aug 29 '22 10:08 dgud

Hey @rickard-green, I brought this back to the team and there were some concerns raised.

Upon looking at our metrics, it actually shows us hitting memory caps on the machines post our locking fix and so we're concerned that doing an upgrade again will only cause us to run out of memory by increasing vm settings that will effect our memory. Even though distribution buffer busy size does increase our memory if it's decreasing the process spawn, will that be sufficient to avoid the OOM errors?

I cannot give you a straight answer on that. The outgoing messages that will be buffered in the distribution buffer will consume a lot less memory than buffering these messages in their own separate processes, but if that is enough is very hard to say.

We're also curious in just a general sense why upgrading OTP versions would cause us to have to change VM parameters and if that's a concern at all, given pre OTP 25 everything works out of the box without any VM tuning on our end

In general we do not introduce incompatibilities in patches, but only in new releases (i.e. when the major OTP version changes). However, a bugfix might cause an incompatibility even in a patch if the only alternative in order to fix the bug is to introduce an incompatibility. This bugfix qualifies as such a bugfix. We did, however, decide not to enable this fix by default in patches on already existing releases. The fix is however available in patches on OTP 22, 23, and 24, and can be enabled on those releases if you desire. As of OTP 25 we enabled this fix as default, but it can still be disabled if the user desires that, and flagged it as a potential incompatibility in the release notes. It is hard to introduce incompatibilities any smoother than that.

If you are ok with the potential issues that might arise when disabling prevent_overlapping_partitions, all you have to do is to disable it on all nodes, then global will behave as before.

rickard-green avatar Aug 29 '22 11:08 rickard-green

Closing this. Please reopen if you have more questions.

rickard-green avatar Oct 10 '22 08:10 rickard-green

Hey @rickard-green , We've attempted an upgrade again and this time we're met with the same errors. net_setuptime has been set to 30 seconds and we've doubled the distribution busy buffer but these messages still occur every few minute

MikaAK avatar Nov 03 '22 19:11 MikaAK