nanomq icon indicating copy to clipboard operation
nanomq copied to clipboard

Broker Restart - aio error

Open arturv2000 opened this issue 2 years ago • 16 comments

Describe the bug

After a couple of hours/days with the broker running, the broker restart due to errors, check the log file.

First it starts generating error open /proc/stat failed', I am using the API to get statistics data from the broker. After a little bit of time, it starts generating error send aio error Out of files` until that after a couple of minutes (typically 5 minutes) the broker restarts.

Expected behavior

Actual Behavior

This seems to happen every 2~3 days of operation.

Log File, when the errors start happening

2024-03-07 18:43:27 [1831111] INFO  /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:27 [1831130] INFO  /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831118] INFO  /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831122] ERROR /home/runner/work/nanomq/nanomq/nanomq/rest_api.c:1387: open /proc/stat failed!
2024-03-07 18:43:28 [1831125] INFO  /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831127] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/ACM] in client [NodeRed1]. pid [63266]
2024-03-07 18:43:28 [1831120] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/ACM/cmd] in client [NodeRed1]. pid [63267]
2024-03-07 18:43:28 [1831122] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/Software/Events] in client [NodeRed1]. pid [63268]
2024-03-07 18:43:28 [1831109] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/VNT] in client [NodeRed1]. pid [63269]
2024-03-07 18:43:28 [1831129] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [$SYS/#] in client [NodeRed1]. pid [63270]
2024-03-07 18:43:28 [1831122] INFO  /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/VNT/cmd] in client [NodeRed1]. pid [63271]
2024-03-07 18:43:28 [1831114] INFO  /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831116] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831121] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831114] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831115] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831130] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831130] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831116] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831123] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831113] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831128] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831125] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831117] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831125] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files
2024-03-07 18:43:28 [1831116] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769:  send aio error Out of files

To Reproduce

Environment Details

  • NanoMQ version v0.21.6-6
  • Operating system and version Ubuntu Server 22.04
  • Compiler and language used, installed from file

Client SDK

Additional context

arturv2000 avatar Mar 07 '24 18:03 arturv2000

Ok. will look into it. Apparently, The ACL is configured, It would be better if you could provide your config and log files. I highly suspect this is related to HTTP REST API request. Need to know what API you are using. I have done a lot of test while fixing the bug with TLS transport(https://github.com/nanomq/NanoNNG/pull/862), but with HTTP disabled.

Also, would appreciate if you can verify 0.20/0.19 ver.

JaylinYu avatar Mar 08 '24 02:03 JaylinYu

BTW, It will only be considered as a bug if the FD doesn't decrease after no client keeps disconnecting. Otherwise is just a simple networking problem.

you can refer to long discussion here https://github.com/nanomq/nanomq/discussions/1546

JaylinYu avatar Mar 08 '24 04:03 JaylinYu

Hi @arturv2000, THX for your feedback, I built a demo locally to replicate the issues described in the issue

$ ulimit -n 30 && ./nanomq/nanomq start --conf ../etc/nanomq.conf --log_level warn

$ emqtt_bench sub -p 2883 -c 18 -t topic1 -q 0

$ emqtt_bench sub -p 2883 -c 1 -t topic1 -q 0
This is where the above 'send aio error Out of files' occurs.

$ curl -i --basic -u admin:public -X GET "http://localhost:8082/api/v4/prometheus"
This is where the above 'open /proc/stat failed!' occurs.

Surprisingly my local nanomq did not quit after going through the above steps and waiting for some time.

In addition, the cause of these errors is caused by the limit of the number of FDS in the system. You can try to use ulimit to modify the number of FDS that can be used by the program to circumvent this problem.

This nanomq exit issue may require more information such as replication steps, configuration files, etc.

RanMaoyi avatar Mar 08 '24 04:03 RanMaoyi

Ok. will look into it. Apparently, The ACL is configured, It would be better if you could provide your config and log files. I highly suspect this is related to HTTP REST API request. Need to know what API you are using. I have done a lot of test while fixing the bug with TLS transport(nanomq/NanoNNG#862), but with HTTP disabled.

Also, would appreciate if you can verify 0.20/0.19 ver.

Hello

Regarding the configuration.

nanomq.conf -> Is pretty much the default configuration, at the moment don't have any TLS configured. NOTE: Current log configuration is warn, was at info at the time the log file was generated.

# NanoMQ Configuration 0.18.0

# #============================================================
# # NanoMQ Broker
# #============================================================

mqtt {
    property_size = 32
    max_packet_size = 10KB
    max_mqueue_len = 2048
    retry_interval = 10s
    keepalive_multiplier = 1.25

    # Three of below, unsupported now
    max_inflight_window = 2048
    max_awaiting_rel = 10s
    await_rel_timeout = 10s
}

listeners.tcp {
    bind = "0.0.0.0:1883"
}

# listeners.ssl {
#       bind = "0.0.0.0:8883"
#       keyfile = "/etc/certs/key.pem"
#       certfile = "/etc/certs/cert.pem"
#       cacertfile = "/etc/certs/cacert.pem"
#       verify_peer = false
#       fail_if_no_peer_cert = false
# }

listeners.ws {
    bind = "0.0.0.0:8083/mqtt"
}

http_server {
    port = 8081
    limit_conn = 2
    username = admin
    password = xxxxxxxxx
    auth_type = basic
    jwt {
        public.keyfile = "/etc/certs/jwt/jwtRS256.key.pub"
    }
}

log {
    to = [file, console]
    level = warn
    dir = "/tmp"
    file = "nanomq.log"
    rotation {
        size = 10MB
        count = 5
    }
}

auth {
    allow_anonymous = false
    no_match = deny
    deny_action = disconnect

    cache = {
        max_size = 32
        ttl = 1m
    }

    password = {include "/etc/nanomq_pwd.conf"}
    acl = {include "/etc/nanomq_acl.conf"}
}

nanomq_acl -> It the same as installed

# #====================================================================
# # Configuration for Acl
# #====================================================================

rules = [
        # # Allow MQTT client using username "dashboard"  to subscribe to "$SYS/#" topics
        {"permit": "allow", "username": "dashboard", "action": "subscribe", "topics": ["$SYS/#"]}

        # # Deny "All Users" subscribe to "$SYS/#" "#" Topics
        # {"permit": "deny", "username": "#", "action": "subscribe", "topics": ["$SYS/#", "#"]}

        # # Allow any other publish/subscribe operation
        {"permit": "allow"}
]

The log file, for the last time that it occur. nanomq.log.1.zip

Regarding the API Requests, at the moment the request is being made via a NodeRed instance using the default http request node.

  • Have calls to /api/v4/nodes every 10 seconds, and at every restart of Nodered (happens quite a lot, since i am developing the dashboard)
  • Have calls to api/v4/brokers for every restart of NodeRed Flows
  • Have calls to api/v4/metrics every second, and at every restart of NodeRed
  • Have calls to api/v4/clients every 30 seconds, and at every restart of NodeRed
  • Have calls to api/v4/prometheus every 5 seconds, and at every restart of NodeRed

Could the problem be "exaggerated" due to the fact that there are too many API calls, and the fact that they could be done at the "same time".

The broker at the moment has a total of 4~5 clients connected, 2 clients are sending between 2~20messages a second mainly published to a single topic, but occasionally are messages in other topics, and there are 2~3 clients connected subscribing the messages from the 2 clients.

arturv2000 avatar Mar 08 '24 10:03 arturv2000

BTW, It will only be considered as a bug if the FD doesn't decrease after no client keeps disconnecting. Otherwise is just a simple networking problem.

you can refer to long discussion here #1546

Thaks for the hint.

Strange i searched in github by the error, and nothing showup..

I will try to check the ulimit and the number of FD files open. At the moment is quite low (is at 18) (but during the night there are no clients connected sending MQTT messages).

Thank You

arturv2000 avatar Mar 08 '24 10:03 arturv2000

Hi

Just to update.

It is indeed the same behavior as in #1546

After sometime of broker is working, the FD Handles stop being "recycled". It doesn´t happen imediattly, but only starts to happen after around 30~40 hours of broker operation. After this whenever a clients disconnects it seems the FD Handlers are not released they keep increasing with newer connections, and even if no new connections happen for a while the number would not decrease.

This a chart with the number of FD Handles or the last 24 hours image

The big jumps that appear around 12-13 and from 16-21 was when i was working on a project (using Node-Red) and it was necessary to restart the project (and the broker connection) many times.

At this point, if i make a new connection and then disconnect, the FD Handles will always increase (tested with MQTT Clients MQTT Explorer and MQTTX)

If I run sockstat nanomq only a couple of sockets appear associated to nanoMq

tis      nanomq               2594240  tcp4   *:1883                    *:*                       LISTEN
tis      nanomq               2594240  tcp4   *:8083                    *:*                       LISTEN
tis      nanomq               2594240  tcp4   *:8081                    *:*                       LISTEN
tis      nanomq               2594240  tcp4   86.48.3.211:8083          104.152.52.161:53537      ESTABLISHED
tis      nanomq               2594240  tcp4   127.0.0.1:8081            127.0.0.1:47408           ESTABLISHED
tis      nanomq               2594240  tcp4   127.0.0.1:1883            127.0.0.1:43120           ESTABLISHED
tis      nanomq               2594240  tcp4   127.0.0.1:8081            127.0.0.1:49402           ESTABLISHED
tis      nanomq               2594240  tcp4   86.48.3.211:1883          94.61.155.128:64840       ESTABLISHED
tis      nanomq               2594240  tcp4   86.48.3.211:8083          185.37.209.99:6784        ESTABLISHED

runing ls -l /proc/{pid}/fd effectivally gives a lot of handles associated (most of them are red in the console) nanomq_fd_handles.txt

And here is the nanomq log at the moment, nothing extraordinary in the logs, at least nanomq.log

Increasing the number of file handlers for the broker, doesn't seems a valid solution, it will only delay the inevitable crash and restart of the broker.

Thanks

arturv2000 avatar Mar 10 '24 10:03 arturv2000

#1546 is already fixed, and that issue only happens with TLS. Make sure you are using ver after that fix. but I havent done a lot of test on websocket though. would be appreciate if you can disable websockt to check if issue is still there.

So... as I said, Have you verified whether the number of used FD decreased after you stop connect/disconnect clients?

And I noticed a lot of "Warning: close pipe & kick client due to KeepAlive timeout!" in the log, apprantely clients didnt close TCP socket elegantly, this will stack up fd use. Also, a lot of QoS unack msg, your clients seems out of rage, and cannot keep the pace up, suggest you turn to qos0.

JaylinYu avatar Mar 10 '24 13:03 JaylinYu

#1546 is already fixed, and that issue only happens with TLS. Make sure you are using ver after that fix. but I havent done a lot of test on websocket though. would be appreciate if you can disable websockt to check if issue is still there.

I mentioned #1546 since the behaviour is very identical. Regarding the WebSocket, I really can't disabled it, the devices that are sending data to the broker unfortunately are only able to use WebSocket, but these devices are not "restarting" very frequently. At least I am not receiving Connect/Disconnect events either via API or in the subscription I have to the SYS topic. Node-red is the client that is restarting frequently and possibly do not terminate the connection "gracefully" when I perform a redeploy. Node-Red is connecting using TCP (standard connection).

So... as I said, Have you verified whether the number of used FD decreased after you stop connect/disconnect clients?

Well, during the night (after 22) there were no more client reconnects (except some web bots / crawlers) the FD number dind't reduce, in fact it increase for each bot connection.

Even if I use a more "standard" client, like MQTTX to connect / disconnect, at the moment the FD handlers increase for each connection I make, and they don't seem to disappear after a while.

In fact, at the moment was setting up another instance of NodeRed, and forgot to configure the user name and password for the broker, and FD count is rising for each attempt of connection from NodeRed..

the log entries look like this, for each connection refused:

2024-03-10 14:50:47 [2594247] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/protocol/mqtt/nmq_mqtt.c:749: Invalid auth info.
2024-03-10 14:50:47 [2594259] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:641: nni aio recv error!! Object closed

2024-03-10 14:50:47 [2594259] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:909: tcptran_pipe_recv_cb: parse error rv: 139

And I noticed a lot of "Warning: close pipe & kick client due to KeepAlive timeout!" in the log, apprantely clients didnt close TCP socket elegantly, this will stack up fd use.

When this happens the FD indeed start to stack-up and seems they are never released. This behaviour do not happen immediately after the server restarts. In the log file there are instances of this error after the restart (2024-03-07) but the FD count kept low at least until the end of next day.

Also, a lot of QoS unack msg, your clients seems out of rage, and cannot keep the pace up, suggest you turn to qos0.

I found that strange, I am almost sure that I have the clients doing the publish with QOS:0, the data is not critical, but something may be incorrect configured when publishing or not defining the QOS at publish time and the library using QOS:1 as default. Will have a look tomorrow at that, don't have access to the devices making the publishing outside the company.

arturv2000 avatar Mar 10 '24 14:03 arturv2000

I found there is indeed a bug with session keeping. have you set clean start = false in your client?

JaylinYu avatar Mar 11 '24 03:03 JaylinYu

Hello

All my clients are supposed to be operating with Clean Session set to true.

arturv2000 avatar Mar 11 '24 10:03 arturv2000

"supposed"? I guess there might be one client "accidentally" enable session keeping and triggers this bug. anyway you can have a try on the new fix

JaylinYu avatar Mar 11 '24 12:03 JaylinYu

Well, "Supposed" since the mqtt part is a "black box" not made by me, so i can't say for sure 😄

Just checked with tshark, that the clients (Node-Red, the one that performs a lot of restarts and the devices) are connecting with Clean Session set to true.

Regarding trying the fix, when is the next release schedule to be?

I don't have enough know how / experience to attempt to sync the git and force the submodule to another branch, compiling and installing the software without breaking anything in the server side...

Maybe in the meanwhile an automatic build (GitHub actions) will be made with the updated branch of NanoNNG and will try at that time.

Thank You

arturv2000 avatar Mar 11 '24 13:03 arturv2000

Fine, one of those clients must be clean start disabled. Will publish a pre-release for you

JaylinYu avatar Mar 12 '24 10:03 JaylinYu

Just to update

I think I have confirmed what is happening.

As long as a single client connects, with flag clean session set to false the fd handlers stop being released and keep increasing with any new connection (independently of the clean session flag state in those new connections).

This was tested with MQTTX, after restarting nanomq, attempt a connection with clean session at false, as expected while is connected the fd increase by one, disconnect and fd number remains the same, reconnect and fd number increases. Repeat and fd number keeps increasing even if clean session is set to true. Same behaviour if usename is wrong, as long clean session is set to false this happens.

And the fact that clients are connecting with clean session set to false is outside of my control, all it takes is a "scan bot" to attempt to make a connection that has the clean session to false to trigger this behaviour, even if the username/password is not valid.

I have checked with Wireshark, all my devices (and helper client software such as MQTTX) are connecting to the mqtt broker have the clean session set to true.

For now, just have a CRON job restarting nanomq every day at 04h, to minimize the issue.

arturv2000 avatar Mar 16 '24 11:03 arturv2000

you can have a try on 0.21.7 now

JaylinYu avatar Mar 24 '24 11:03 JaylinYu

Thanks.

Have updated, made a quick test using mqttX and the simulating connection with clean session set to false seems to not being increasing the FD Handlers with each new connection.

Will wait a couple of days to give extra feedback.

Thank You

arturv2000 avatar Mar 24 '24 22:03 arturv2000

fixed and tested

JaylinYu avatar Jul 22 '24 06:07 JaylinYu