vernemq Cluster ha improvements

Proposed Changes

These changes fix missing messages in QoS1 in a HA VerneMQ cluster as reported in this issue: Message loss on node fault with QoS1(#1700)

Following problems were discovered/addressed:

Make enqueue a synhronous call : When the broker receives a message from a publisher it was written into the subscriber(s) queue(s) using an asynchronous function call. This causes the ack to be sent before the message was written to persistent storage hence leaving a small time-window in which a crash of the broker causes the acknowledged message to be lost.

Solved by using a synchronous call.

Add handshake on inter-node TCP connection : When the broker receives messages for a subscriber connected to another node, the messages are sent via the dedicated inter-node TCP connection. A crash of the receiving node while a message is present in the tcp receive buffer and acknowledged to the sender but not yet processed by the broker will cause that message to be lost. The publisher received an ack when the message was delivered to the senders tcp stack.

Solved by introducing an application-level handshake on the inter-node tcp connection:

The message block "in transit" is buffered until acknowledged by the receiving node.
In case the receiving node crashed and hence no ack was received, the buffer is resent when the connection is re-established.
In case the subscriber reconnected to another node, the resent messages are forwarded to that new node.

Catch crashes during queue migration : During HA testing some some crashes were observed when queue migration was triggered i.,e. rpc call to get queue pid from the crashed node and ets:lookup.

Solved by handling the exceptions.

Only release messages when acknowledged : When the broker delivers messages to a connected subscriber, the messages from the queue are saved in a backup queue and "mailed" to the fsm process that performs the interaction with the client. (retries, acks, ...). But the backup messages were released (=deleted from persistent storage) before they were actually acknowledged by the subscriber. Again, when the broker crashes those messages are lost.

Solved by :

releasing each message individually when the fsm receives the ack.
QoS0 messages are released immediately (unless potentially being upgraded by "maybe_upgrade_qos")
next message batch is now added to the backup queue instead of replacing it.

Types of Changes

[x] Bugfix (non-breaking change which fixes issue #1700)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation (correction or otherwise)
[ ] Cosmetics (whitespace, styles...)
[ ] DevOps (Build scripts, pipelines...)

Checklist

[x] I have read the CODE_OF_CONDUCT.md document
[x] All tests pass locally with my changes
[x] I have added tests that prove my fix is effective or that my feature works
[ ] I have added necessary documentation (if needed)
[ ] Any dependent changes have been merged and published in related repositories
[ ] I have updated changelog (At the bottom of the release version)
[ ] I have squashed all my commits into one before merging

Further Comments

These fixes do not yet cover all cases of message loss but already drastically reduced their ocurrence:

Without the fixes about 1 in 3 HA testruns showed message loss. Above fixes reduced this to about 1 lost message in 1000 testruns.

Apr 07 '21 13:04 MarcVanOevelen

@MarcVanOevelen thanks... what a great effort! :) As it's a rather big change it'll take me a while to review. We also need to make sure that any "change of strategy" (from async to sync processing, for instance) has no downstream effects.

If you feel I need any more specific information to for an effective review, let me know. (I don't think so, for now at least).

Apr 07 '21 13:04 ioolkos

@ioolkos Thanks!

It took me quite some time to get familiar with the codebase but I must say it is very well structured, and this is such a great project that it was really worth the effort!

Indeed such changes are very tricky given the many features and use-cases. Hopefully they don't cause any side-effects that are not covered by the tests.

Our use case is focused on a small number of publishers/subscribers but we really need reliable message delivery (QoS1) in a HA cluster.

I still have to assess the inevitable performance cost of those changes.

The release smoketest fails now but I suppose this is caused by the rebar3_cuttlefish dependency which points to master. I had to specify the previous version to get this working again.

I also see that a test is failing which did not happen when I ran them locally?

If you need more info please let me know. Thanks.

Apr 07 '21 15:04 MarcVanOevelen

Common Test suites and Dialyzer pass for me locally.

Apr 12 '21 08:04 ioolkos

@MarcVanOevelen no in-depth review yet, but I like your code/commenting style, very much aligned with existing code.

It'd be great to have an idea on performance impact, as you already mentioned. We also need to protect any RAM-grabbing component (buffer/queue/backup) or at least ensure that it will hit an upstream config limit (haven't yet looked into he details of that point though).

Apr 12 '21 09:04 ioolkos

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

Apr 16 '21 14:04 ioolkos

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

I have successfully completed the HA behavior tests in our production system. Now I am also running performance tests relative to vernemq:1.11.0-alpine deployment using helm chart in k8s. These tests try to assess max sustainable message rates and latency using 1 publisher and 1 subscriber with following parameters:

QoS: [0, 1, 2]
message sizes: [100 bytes, 64K bytes]
setup : [single node, 2-node cluster]
clients connect via k8s service, hence in 2-node cluster publisher and subscriber connect to different nodes I almost completed testing but unfortunately we just suffered from a power outage. Results will be ready in the course of next week.

Since performance tests are very dependent on the test-environment (resources, ...) we can learn most from relative tests. But it would be nice if you could test using the same parameters so we can compare our results.

Apr 16 '21 14:04 MarcVanOevelen

@MarcVanOevelen great, thanks. Hopefully no power outages any more :)

Apr 16 '21 16:04 ioolkos

Did a basic test to get a feeling for latencies. This is a quick test with 50 publisher on node 1 to 1 subscriber on node 2, sending 1500 100 byte messages total per second (QoS 1). Highest line is max latency, then 95 percentile. Max line is at 90ms vs 30ms. So, this adds 60ms.

1769 Branch: Main Branch Master Branch

EDIT: I also noticed that CPU on both nodes stays very high after the test finishes, with 1769 branch. This needs to be investigated too. EDIT2: The above happens as soon as you cluster nodes. vmq_cluster_node is needlessly looping somehow.

Apr 18 '21 08:04 ioolkos

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

Apr 18 '21 13:04 ioolkos

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

Nice catch! I included the original "Pending" check again. Thanks!

Apr 18 '21 18:04 MarcVanOevelen

@ioolkos performance test comparison results in attached pdf VerneMQ_performance_test.pdf

Apr 21 '21 11:04 MarcVanOevelen

@MarcVanOevelen thanks... can you leave me your e-mail address at contact email here: https://vernemq.com/community.html

Apr 21 '21 11:04 ioolkos

@ioolkos and here the HA test results VerneMQ_cluster_HA_test.pdf

Apr 21 '21 12:04 MarcVanOevelen