docker-vernemq
docker-vernemq copied to clipboard
Subscription propagation delay
Hello, our production environment is made up of nodes running with Kubernetes on Google Cloud Platform. Over that is running an official docker image "erlio/docker-vernemq:1.9.2", in particular a cluster made up of 3 PODs. Every POD has HAProxy in front of container Vernemq and have resources limits 1 cpu and 3 Gb RAM each one. We noticed that occasionally some devices lost messages. The situation in detail is as follows: 2 devices communicate via MQTT, the first subscribes to a specific topic and the second publishes messages on this topic. The first device occasionally loses messages. Deepening this problem we have noticed that every so often a node of Vernemq has problems of propogation of the subscriptions through the cluster, or rather there is a delay (which can exceed even 30 seconds) in the propagation of the subscription to that node made originally on another node. This implies that if a message is published when the subscription has not yet been propagated, the message is lost. We have developed a script to roughly measure this delay and to graph it. I insert the graph obtained from our surveys in a preproduction environment sized as the production one.
Hi, so since subscriptions are part of the eventually consistent metadata it is expected that subscriptions are not atomic/instant throughout the cluster. Usually the propagation is almost instant or at least very fast (ms) - if not then this usually points to something being overloaded. Perhaps there are many connects/disconnects at that time or many clients doing subscriptions or unsubscribing at the same time, which makes things more eventually consistent...
Hope this helps.
Hi larshesel,
thank you for your answer. I can understand, but the graph I attached is relative to the pre-production environment, where we have the following data: about 12 permanently subscribed clients, an average of 2-3 messages per second, 9 more subscriptions every 30 seconds that unsubscribe after propagation test. The subscription propagation delay from node 0 and 1 to node 2 is near 5-10 seconds and after 24 hours it does not decrease. Other propagations are under 500 milliseconds.
Ok, thanks, that's strange. Is the delay only visible in one way? for instance 0 -> 2, but not 2 -> 0 ? Do you see any other noticeable differences in metrics for node 2?
This is an example of metrics:
from node 0 to node 1: 97 from node 1 to node 1: 1 from node 2 to node 1: 2 from node 0 to node 0: 93 from node 1 to node 0: 101 from node 2 to node 0: 2 from node 0 to node 2: 9006 from node 1 to node 2: 4098 from node 2 to node 2: 1
Other metrics like memory or cpu usage don't show particular conditions for that node, are similar to other nodes
Is this with the Plumtree metadata protocol, or the SWC protocol?
@anyone could this be some lazy set/ eager set protocol thing? like, one paths are directly broadcasted, the others are anti-entropy'd?
Hi ioolkos.
How can we understand which protocol is used? In config can't find nothing relative.
Thank you
Hi @tinto82
it's the metadata_plugin
config in your vernemq.conf
If you haven't touched it, it's most probably plumtree.
we don't define that value in our config file
it may be helpful we'll try using the SWC protocol
Yes, what I was wondering is whether your original testing looks the same using the SWC protocol.
Now we are using version 1.9.2. Is ok or should we try with 1.10.0?
1.9.2 is ok (and 1.10.0 has EULA). One thing though: you're on a testing cluster, right? Because the change to SWC will create new data directories and not be compatible with the old ones. Keep that in mind, so don't change this on production on a friday afternoon :)
Hi ioolkos.
We are trying in a test environment with SWC protocol (putting "metadata_plugin = vmq_swc" in vernemq config file), but in logs at startup can't find a specific log about swc. There is a way to understand which protocol is used at runtime?
Thank you
Hi,
we tried in our test environment with the SWC protocol keeping 1.9.2 version. Delays seem significantly reduced. So we would like to ask you:
- what to use between plumtree and SWC?
- Are you promoting more one of the two protocols?
- Do you think that in the future one of the 2 protocols will be discontinued?
- Since we occasionally have application crash problems, could using the SWC protocol give us more stability?
Thank you
SWC is the future! :)
Good! Then after a period of testing we will promote the SWC protocol in production.
In the test environment we had problems with deployment, as you had announced. In the end we had to delete the vernemq cluster and the relative disks and than deploy the new vernemq cluster with SWC protocol. Do you know a smooth way to change protocol?
Thank you!
No smoth way, I fear. Especially for a running plumtree cluster. A smooth upgrade would mean both protocols could run in the same cluster. Not worth the effort implementing this.
Thank you ioolkos.
Now after an update a node has trouble. The other 2 nodes are ok. The logs are these:
14:10:52.332 [error] gen_server vmq_swc_store_meta9 terminated with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.332 [error] CRASH REPORT Process vmq_swc_store_meta9 with 0 neighbours crashed with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.333 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store,meta9} started with vmq_swc_store:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5202.0> exit with reason no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 in context child_terminated 14:10:52.333 [error] CRASH REPORT Process vmq_swc_store_batcher_meta9 with 0 neighbours exited with reason: {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in gen_server:call/3 line 223 14:10:52.334 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store_batcher,meta9} started with vmq_swc_store_batcher:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5205.0> exit with reason {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in context child_terminated
I'm not able to solve it. Can you help understand the problem? Thank you
@tinto82 what is the release version you use? (please try this with latest release or master). What is the procedure you did? what exactly is "now after an update?".
Hi,
we use the version 1.9.2. We added a row in "vmq.acl", so we needed to restart pods
Ok, this encounters a wrong version vector. Hopefully this is among the bugs fixed in 1.10. cc @dergraf
Thank you ioolkos.
Now after an update a node has trouble. The other 2 nodes are ok. The logs are these:
14:10:52.332 [error] gen_server vmq_swc_store_meta9 terminated with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.332 [error] CRASH REPORT Process vmq_swc_store_meta9 with 0 neighbours crashed with reason: no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 14:10:52.333 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store,meta9} started with vmq_swc_store:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5202.0> exit with reason no case clause matching {ok,{6,1393796574908163946345982392040522594123776}} in swc_node:event/2 line 186 in context child_terminated 14:10:52.333 [error] CRASH REPORT Process vmq_swc_store_batcher_meta9 with 0 neighbours exited with reason: {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in gen_server:call/3 line 223 14:10:52.334 [error] Supervisor vmq_swc_store_sup_meta9 had child {vmq_swc_store_batcher,meta9} started with vmq_swc_store_batcher:start_link({swc_config,'VerneMQ@{pod}.{service}.default.svc.cluster.local',meta9,vmq_swc_db_meta9,vmq_swc_db_leveldb,...}) at <0.5205.0> exit with reason {{{case_clause,{ok,{6,1393796574908163946345982392040522594123776}}},[{swc_node,event,2,[{file,"/vernemq-build/_build/default/lib/swc/src/swc_node.erl"},{line,186}]},{vmq_swc_store,process_write_op,2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,635}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,'-handle_call/3-fun-1-',2,[{file,"/vernemq-build/apps/vmq_swc/src/vmq_swc_store.erl"},{line,268}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{vmq_swc_store,...},...]},...} in context child_terminated
I'm not able to solve it. Can you help understand the problem? Thank you
Hi, I have the same problem with that log when switch from plumbtree to swc for metadata. Do you have any solution for this issue?
Thanks!
@nguyenduybinh In case you just downloaded release 1.10.4, do not use, builds contained a regression and I deleted them. If not, what is your VerneMQ version?
@nguyenduybinh In case you just downloaded release 1.10.4, do not use, builds contained a regression and I deleted them. If not, what is your VerneMQ version?
Thanks for your response,
Vernemq version i use is 1576069074:1.10.1 on Ubuntu 18.04.
Thanks!