emqx
emqx copied to clipboard
runq_overload alert on using MongoDB for authz/authn and also alert gets stuck for days sometimes
What happened?
I have around 33,000 clients to my EMQX 5.5 cluster with 2 nodes.
Average messages received per sec ~ 200-300 Average messages sent per sec ~ 100-200
Nodes usage: RAM: ~1.75/4GB (~45%) CPU: ~20% (2 vCPUs combined)
I get this alert everyday for atleast 5-7 times
runq_overload:
VM is overloaded on node: '
I also followed everything mentioned in EMQX Tuning Guide (https://docs.emqx.com/en/emqx/v5.0/performance/tune.html)
In staging server I also tried with 4 vCPUs, the issue is same when I use MongoDB auth
I could not find any solution to this, please help me
Sometimes alert get stuck and wont shut off, for more details look into this thread When I try emqx eval "emqx_olp:is_overloaded() its returning false (https://github.com/emqx/emqx/discussions/13188#discussioncomment-9827982)
What did you expect to happen?
No runq_overload, as my CPU is enough and even when the alert is active, it should turn off as expected
How can we reproduce it (as minimally and precisely as possible)?
Use MongoDB authz and authn and try connecting more than 30K clients. Index the required fields in MongoDB.
Anything else we need to know?
No response
EMQX version
$ ./bin/emqx_ctl broker
sysdescr : EMQX
version : 5.5.0
datetime : 2024-06-20T11:40:11.303518385+00:00
uptime : 15 days, 17 hours, 38 minutes, 3 seconds
OS version
# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux <nodename> 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Log files
~I do not believe this is a bug, hence we should continue the discussion in the original discussion thread:~ https://github.com/emqx/emqx/discussions/13188
just to be clear, runq_overload is for monitoring/alerting.
We do not expect runq_overload alert from a healthy system.
We do expect raising/clearing runq_overload alert on a resource tight system and it is a sign of resource saturation.
This issue report is about runq_overload alert is not cleared when the system is not overloaded. refer to https://github.com/emqx/emqx/discussions/13188#discussioncomment-9826510
Yes, but also I feel like there is some problem with mongoDB authn/authz plugins, because when I was using EMQX 4, there was no issue like this. Even when I scaled up Im getting alerts only when I enable authn and authz on MongoDB only. Please check if there is any issue in auth plugins with MongoDB, this is very critical for our system.
@chaymankala Pls create another issue/discussion for performance issue regarding mongoDB and provide your configuration to help understand if it is a concern
V4 has no runq overload alarms. Is there monitoring data to support the hypothesis that v5 has performance degraded?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Okay, you are saying that runq_overload might have happened in v4 too, but as we dont have any alerting, we didnt see that. I get it. But why am I getting runq_overload and its not clearing from alerts. Its been in alerting state since 45 days?
And as of today, whenever I run this command I get 0
emqx eval 'length(lists:filter(fun(Pid) -> case process_info(Pid,current_function) of {current_function,{qlc,wait_for_request,3}} -> true; _ -> false end end, processes())).'
@chaymankala, I just want to add that in v4, you don't get alarm but you get logging printout in log file. That you could have a check if you have same perf issue.
For the alerting not cleared, I cannot reproduce it locally but I will create another ticket just track such issue.
We dont have the v4 setup right now, please create another ticket to track the not clearing issue
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @chaymankala To investigate why the alarm is not cleared even if the system seems idle, If you still have it happening, I can help to have a closer look in your environment over a screen-share session. You are welcome to reach me in the email in my git commit logs.
@zmstone @chaymankala, be aware investigation is moved to #13501.
If you prefer to discuss here or in person, pls close the #13501
Hi @zmstone That would be great if we can screen-share, but recently we updated it to 5.7.0 so we had to restart the cluster. Once the cluster is restarted, the alarms are cleared, So the next time when the alarms get stuck, I will reach out to you. Thank you so much.
@qzhuyan Sure we can discuss in #13501
alright, thanks. close this issue now.