docker-vernemq icon indicating copy to clipboard operation
docker-vernemq copied to clipboard

Subscription propagation delay

Open tinto82 opened this issue 5 years ago • 24 comments

Hello, our production environment is made up of nodes running with Kubernetes on Google Cloud Platform. Over that is running an official docker image "erlio/docker-vernemq:1.9.2", in particular a cluster made up of 3 PODs. Every POD has HAProxy in front of container Vernemq and have resources limits 1 cpu and 3 Gb RAM each one. We noticed that occasionally some devices lost messages. The situation in detail is as follows: 2 devices communicate via MQTT, the first subscribes to a specific topic and the second publishes messages on this topic. The first device occasionally loses messages. Deepening this problem we have noticed that every so often a node of Vernemq has problems of propogation of the subscriptions through the cluster, or rather there is a delay (which can exceed even 30 seconds) in the propagation of the subscription to that node made originally on another node. This implies that if a message is published when the subscription has not yet been propagated, the message is lost. We have developed a script to roughly measure this delay and to graph it. I insert the graph obtained from our surveys in a preproduction environment sized as the production one.

image

tinto82 avatar Dec 20 '19 11:12 tinto82

Hi, so since subscriptions are part of the eventually consistent metadata it is expected that subscriptions are not atomic/instant throughout the cluster. Usually the propagation is almost instant or at least very fast (ms) - if not then this usually points to something being overloaded. Perhaps there are many connects/disconnects at that time or many clients doing subscriptions or unsubscribing at the same time, which makes things more eventually consistent...

Hope this helps.

larshesel avatar Dec 20 '19 12:12 larshesel

Hi larshesel,

thank you for your answer. I can understand, but the graph I attached is relative to the pre-production environment, where we have the following data: about 12 permanently subscribed clients, an average of 2-3 messages per second, 9 more subscriptions every 30 seconds that unsubscribe after propagation test. The subscription propagation delay from node 0 and 1 to node 2 is near 5-10 seconds and after 24 hours it does not decrease. Other propagations are under 500 milliseconds.

tinto82 avatar Dec 20 '19 13:12 tinto82

Ok, thanks, that's strange. Is the delay only visible in one way? for instance 0 -> 2, but not 2 -> 0 ? Do you see any other noticeable differences in metrics for node 2?

ioolkos avatar Dec 20 '19 13:12 ioolkos

This is an example of metrics:

from node 0 to node 1: 97 from node 1 to node 1: 1 from node 2 to node 1: 2 from node 0 to node 0: 93 from node 1 to node 0: 101 from node 2 to node 0: 2 from node 0 to node 2: 9006 from node 1 to node 2: 4098 from node 2 to node 2: 1

Other metrics like memory or cpu usage don't show particular conditions for that node, are similar to other nodes

tinto82 avatar Dec 20 '19 13:12 tinto82

Is this with the Plumtree metadata protocol, or the SWC protocol?

@anyone could this be some lazy set/ eager set protocol thing? like, one paths are directly broadcasted, the others are anti-entropy'd?

ioolkos avatar Dec 20 '19 13:12 ioolkos

Hi ioolkos.

How can we understand which protocol is used? In config can't find nothing relative.

Thank you

tinto82 avatar Jan 07 '20 13:01 tinto82

Hi @tinto82 it's the metadata_plugin config in your vernemq.conf If you haven't touched it, it's most probably plumtree.

ioolkos avatar Jan 07 '20 13:01 ioolkos

we don't define that value in our config file

tinto82 avatar Jan 07 '20 13:01 tinto82

it may be helpful we'll try using the SWC protocol

tinto82 avatar Jan 24 '20 14:01 tinto82

Yes, what I was wondering is whether your original testing looks the same using the SWC protocol.

ioolkos avatar Jan 24 '20 14:01 ioolkos

Now we are using version 1.9.2. Is ok or should we try with 1.10.0?

tinto82 avatar Jan 24 '20 15:01 tinto82

1.9.2 is ok (and 1.10.0 has EULA). One thing though: you're on a testing cluster, right? Because the change to SWC will create new data directories and not be compatible with the old ones. Keep that in mind, so don't change this on production on a friday afternoon :)

ioolkos avatar Jan 24 '20 15:01 ioolkos

Hi ioolkos.

We are trying in a test environment with SWC protocol (putting "metadata_plugin = vmq_swc" in vernemq config file), but in logs at startup can't find a specific log about swc. There is a way to understand which protocol is used at runtime?

Thank you

tinto82 avatar Jan 30 '20 11:01 tinto82

Hi,

we tried in our test environment with the SWC protocol keeping 1.9.2 version. Delays seem significantly reduced. So we would like to ask you:

  • what to use between plumtree and SWC?
  • Are you promoting more one of the two protocols?
  • Do you think that in the future one of the 2 protocols will be discontinued?
  • Since we occasionally have application crash problems, could using the SWC protocol give us more stability?

Thank you

tinto82 avatar Feb 05 '20 11:02 tinto82

SWC is the future! :)

ioolkos avatar Feb 05 '20 12:02 ioolkos

Good! Then after a period of testing we will promote the SWC protocol in production.

In the test environment we had problems with deployment, as you had announced. In the end we had to delete the vernemq cluster and the relative disks and than deploy the new vernemq cluster with SWC protocol. Do you know a smooth way to change protocol?

Thank you!

tinto82 avatar Feb 05 '20 13:02 tinto82

No smoth way, I fear. Especially for a running plumtree cluster. A smooth upgrade would mean both protocols could run in the same cluster. Not worth the effort implementing this.

ioolkos avatar Feb 05 '20 14:02 ioolkos

Thank you ioolkos.

Now after an update a node has trouble. The other 2 nodes are ok. The logs are these:

14:10:52.332 [error] gen_server vmq_swc_store_meta9 terminated with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.332 [error] CRASH REPORT Process vmq_swc_store_meta9 with 0 neighbours crashed with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.333 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store,meta9} started with vmq_swc_store:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5202.0> exit with reason no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 in context child_terminated 14:10:52.333 [error] CRASH REPORT Process vmq_swc_store_batcher_meta9 with 0 neighbours exited with reason: {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in gen_server:call/3 line 223 14:10:52.334 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store_batcher,meta9} started with vmq_swc_store_batcher:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5205.0> exit with reason {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in context child_terminated

I'm not able to solve it. Can you help understand the problem? Thank you

tinto82 avatar Feb 12 '20 14:02 tinto82

@tinto82 what is the release version you use? (please try this with latest release or master). What is the procedure you did? what exactly is "now after an update?".

ioolkos avatar Feb 12 '20 14:02 ioolkos

Hi,

we use the version 1.9.2. We added a row in "vmq.acl", so we needed to restart pods

tinto82 avatar Feb 12 '20 14:02 tinto82

Ok, this encounters a wrong version vector. Hopefully this is among the bugs fixed in 1.10. cc @dergraf

ioolkos avatar Feb 12 '20 14:02 ioolkos

Thank you ioolkos.

Now after an update a node has trouble. The other 2 nodes are ok. The logs are these:

14:10:52.332 [error] gen_server vmq_swc_store_meta9 terminated with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.332 [error] CRASH REPORT Process vmq_swc_store_meta9 with 0 neighbours crashed with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.333 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store,meta9} started with vmq_swc_store:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5202.0> exit with reason no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 in context child_terminated 14:10:52.333 [error] CRASH REPORT Process vmq_swc_store_batcher_meta9 with 0 neighbours exited with reason: {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in gen_server:call/3 line 223 14:10:52.334 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store_batcher,meta9} started with vmq_swc_store_batcher:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5205.0> exit with reason {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in context child_terminated

I'm not able to solve it. Can you help understand the problem? Thank you

Hi, I have the same problem with that log when switch from plumbtree to swc for metadata. Do you have any solution for this issue?

Thanks!

nguyenduybinh avatar Jul 28 '20 10:07 nguyenduybinh

@nguyenduybinh In case you just downloaded release 1.10.4, do not use, builds contained a regression and I deleted them. If not, what is your VerneMQ version?

ioolkos avatar Jul 28 '20 10:07 ioolkos

@nguyenduybinh In case you just downloaded release 1.10.4, do not use, builds contained a regression and I deleted them. If not, what is your VerneMQ version?

Thanks for your response,

Vernemq version i use is 1576069074:1.10.1 on Ubuntu 18.04.

Thanks!

nguyenduybinh avatar Jul 28 '20 10:07 nguyenduybinh