emqx runq_overload alert on using MongoDB for authz/authn and also alert gets stuck for days sometimes

What happened?

I have around 33,000 clients to my EMQX 5.5 cluster with 2 nodes.

Average messages received per sec ~ 200-300 Average messages sent per sec ~ 100-200

Nodes usage: RAM: ~1.75/4GB (~45%) CPU: ~20% (2 vCPUs combined)

I get this alert everyday for atleast 5-7 times runq_overload: VM is overloaded on node: '@': 196 The overload count is random sometimes its around 35 sometimes its as high as 500-600

I also followed everything mentioned in EMQX Tuning Guide (https://docs.emqx.com/en/emqx/v5.0/performance/tune.html)

In staging server I also tried with 4 vCPUs, the issue is same when I use MongoDB auth

I could not find any solution to this, please help me

Sometimes alert get stuck and wont shut off, for more details look into this thread When I try emqx eval "emqx_olp:is_overloaded() its returning false (https://github.com/emqx/emqx/discussions/13188#discussioncomment-9827982)

What did you expect to happen?

No runq_overload, as my CPU is enough and even when the alert is active, it should turn off as expected

How can we reproduce it (as minimally and precisely as possible)?

Use MongoDB authz and authn and try connecting more than 30K clients. Index the required fields in MongoDB.

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQX
version   : 5.5.0
datetime  : 2024-06-20T11:40:11.303518385+00:00
uptime    : 15 days, 17 hours, 38 minutes, 3 seconds

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux <nodename> 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Log files

Jun 20 '24 11:06 chaymankala

~I do not believe this is a bug, hence we should continue the discussion in the original discussion thread:~ https://github.com/emqx/emqx/discussions/13188

Jun 23 '24 09:06 zmstone

just to be clear, runq_overload is for monitoring/alerting. We do not expect runq_overload alert from a healthy system. We do expect raising/clearing runq_overload alert on a resource tight system and it is a sign of resource saturation.

This issue report is about runq_overload alert is not cleared when the system is not overloaded. refer to https://github.com/emqx/emqx/discussions/13188#discussioncomment-9826510

Jun 26 '24 10:06 qzhuyan

Yes, but also I feel like there is some problem with mongoDB authn/authz plugins, because when I was using EMQX 4, there was no issue like this. Even when I scaled up Im getting alerts only when I enable authn and authz on MongoDB only. Please check if there is any issue in auth plugins with MongoDB, this is very critical for our system.

Jun 26 '24 10:06 chaymankala

@chaymankala Pls create another issue/discussion for performance issue regarding mongoDB and provide your configuration to help understand if it is a concern

Jun 28 '24 11:06 qzhuyan

V4 has no runq overload alarms. Is there monitoring data to support the hypothesis that v5 has performance degraded?

Jun 28 '24 13:06 zmstone

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 22 '24 08:07 github-actions[bot]

Okay, you are saying that runq_overload might have happened in v4 too, but as we dont have any alerting, we didnt see that. I get it. But why am I getting runq_overload and its not clearing from alerts. Its been in alerting state since 45 days? Screenshot 2024-07-22 at 1 41 46 PM

And as of today, whenever I run this command I get 0

emqx eval 'length(lists:filter(fun(Pid) -> case process_info(Pid,current_function) of {current_function,{qlc,wait_for_request,3}} -> true; _ -> false end end, processes())).'

Jul 22 '24 08:07 chaymankala

@chaymankala, I just want to add that in v4, you don't get alarm but you get logging printout in log file. That you could have a check if you have same perf issue.

For the alerting not cleared, I cannot reproduce it locally but I will create another ticket just track such issue.

Jul 22 '24 09:07 qzhuyan

We dont have the v4 setup right now, please create another ticket to track the not clearing issue

Jul 22 '24 10:07 chaymankala

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 29 '24 11:07 github-actions[bot]

Hi @chaymankala To investigate why the alarm is not cleared even if the system seems idle, If you still have it happening, I can help to have a closer look in your environment over a screen-share session. You are welcome to reach me in the email in my git commit logs.

Jul 29 '24 20:07 zmstone

@zmstone @chaymankala, be aware investigation is moved to #13501.

If you prefer to discuss here or in person, pls close the #13501

Jul 30 '24 15:07 qzhuyan

Hi @zmstone That would be great if we can screen-share, but recently we updated it to 5.7.0 so we had to restart the cluster. Once the cluster is restarted, the alarms are cleared, So the next time when the alarms get stuck, I will reach out to you. Thank you so much.

Jul 31 '24 08:07 chaymankala

@qzhuyan Sure we can discuss in #13501

Jul 31 '24 08:07 chaymankala

alright, thanks. close this issue now.

Aug 03 '24 19:08 zmstone

emqx emqx copied to clipboard

runq_overload alert on using MongoDB for authz/authn and also alert gets stuck for days sometimes

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

EMQX version

OS version

Log files

emqx
emqx copied to clipboard