Broker Restart - aio error
Describe the bug
After a couple of hours/days with the broker running, the broker restart due to errors, check the log file.
First it starts generating error open /proc/stat failed', I am using the API to get statistics data from the broker. After a little bit of time, it starts generating error send aio error Out of files` until that after a couple of minutes (typically 5 minutes) the broker restarts.
Expected behavior
Actual Behavior
This seems to happen every 2~3 days of operation.
Log File, when the errors start happening
2024-03-07 18:43:27 [1831111] INFO /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:27 [1831130] INFO /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831118] INFO /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831122] ERROR /home/runner/work/nanomq/nanomq/nanomq/rest_api.c:1387: open /proc/stat failed!
2024-03-07 18:43:28 [1831125] INFO /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831127] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/ACM] in client [NodeRed1]. pid [63266]
2024-03-07 18:43:28 [1831120] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/ACM/cmd] in client [NodeRed1]. pid [63267]
2024-03-07 18:43:28 [1831122] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/Software/Events] in client [NodeRed1]. pid [63268]
2024-03-07 18:43:28 [1831109] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/VNT] in client [NodeRed1]. pid [63269]
2024-03-07 18:43:28 [1831129] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [$SYS/#] in client [NodeRed1]. pid [63270]
2024-03-07 18:43:28 [1831122] INFO /home/runner/work/nanomq/nanomq/nanomq/unsub_handler.c:189: UnSub topic [maquina/WHDNG23099-VNT/VNT/cmd] in client [NodeRed1]. pid [63271]
2024-03-07 18:43:28 [1831114] INFO /home/runner/work/nanomq/nanomq/nanomq/pub_handler.c:1172: acl allow
2024-03-07 18:43:28 [1831116] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831121] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831114] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831115] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831130] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831130] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831116] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831123] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831113] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831128] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831125] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831117] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831125] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
2024-03-07 18:43:28 [1831116] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:1769: send aio error Out of files
To Reproduce
Environment Details
- NanoMQ version v0.21.6-6
- Operating system and version Ubuntu Server 22.04
- Compiler and language used, installed from file
Client SDK
Additional context
Ok. will look into it. Apparently, The ACL is configured, It would be better if you could provide your config and log files. I highly suspect this is related to HTTP REST API request. Need to know what API you are using. I have done a lot of test while fixing the bug with TLS transport(https://github.com/nanomq/NanoNNG/pull/862), but with HTTP disabled.
Also, would appreciate if you can verify 0.20/0.19 ver.
BTW, It will only be considered as a bug if the FD doesn't decrease after no client keeps disconnecting. Otherwise is just a simple networking problem.
you can refer to long discussion here https://github.com/nanomq/nanomq/discussions/1546
Hi @arturv2000, THX for your feedback, I built a demo locally to replicate the issues described in the issue
$ ulimit -n 30 && ./nanomq/nanomq start --conf ../etc/nanomq.conf --log_level warn
$ emqtt_bench sub -p 2883 -c 18 -t topic1 -q 0
$ emqtt_bench sub -p 2883 -c 1 -t topic1 -q 0
This is where the above 'send aio error Out of files' occurs.
$ curl -i --basic -u admin:public -X GET "http://localhost:8082/api/v4/prometheus"
This is where the above 'open /proc/stat failed!' occurs.
Surprisingly my local nanomq did not quit after going through the above steps and waiting for some time.
In addition, the cause of these errors is caused by the limit of the number of FDS in the system. You can try to use ulimit to modify the number of FDS that can be used by the program to circumvent this problem.
This nanomq exit issue may require more information such as replication steps, configuration files, etc.
Ok. will look into it. Apparently, The ACL is configured, It would be better if you could provide your config and log files. I highly suspect this is related to HTTP REST API request. Need to know what API you are using. I have done a lot of test while fixing the bug with TLS transport(nanomq/NanoNNG#862), but with HTTP disabled.
Also, would appreciate if you can verify 0.20/0.19 ver.
Hello
Regarding the configuration.
nanomq.conf -> Is pretty much the default configuration, at the moment don't have any TLS configured.
NOTE: Current log configuration is warn, was at info at the time the log file was generated.
# NanoMQ Configuration 0.18.0
# #============================================================
# # NanoMQ Broker
# #============================================================
mqtt {
property_size = 32
max_packet_size = 10KB
max_mqueue_len = 2048
retry_interval = 10s
keepalive_multiplier = 1.25
# Three of below, unsupported now
max_inflight_window = 2048
max_awaiting_rel = 10s
await_rel_timeout = 10s
}
listeners.tcp {
bind = "0.0.0.0:1883"
}
# listeners.ssl {
# bind = "0.0.0.0:8883"
# keyfile = "/etc/certs/key.pem"
# certfile = "/etc/certs/cert.pem"
# cacertfile = "/etc/certs/cacert.pem"
# verify_peer = false
# fail_if_no_peer_cert = false
# }
listeners.ws {
bind = "0.0.0.0:8083/mqtt"
}
http_server {
port = 8081
limit_conn = 2
username = admin
password = xxxxxxxxx
auth_type = basic
jwt {
public.keyfile = "/etc/certs/jwt/jwtRS256.key.pub"
}
}
log {
to = [file, console]
level = warn
dir = "/tmp"
file = "nanomq.log"
rotation {
size = 10MB
count = 5
}
}
auth {
allow_anonymous = false
no_match = deny
deny_action = disconnect
cache = {
max_size = 32
ttl = 1m
}
password = {include "/etc/nanomq_pwd.conf"}
acl = {include "/etc/nanomq_acl.conf"}
}
nanomq_acl -> It the same as installed
# #====================================================================
# # Configuration for Acl
# #====================================================================
rules = [
# # Allow MQTT client using username "dashboard" to subscribe to "$SYS/#" topics
{"permit": "allow", "username": "dashboard", "action": "subscribe", "topics": ["$SYS/#"]}
# # Deny "All Users" subscribe to "$SYS/#" "#" Topics
# {"permit": "deny", "username": "#", "action": "subscribe", "topics": ["$SYS/#", "#"]}
# # Allow any other publish/subscribe operation
{"permit": "allow"}
]
The log file, for the last time that it occur. nanomq.log.1.zip
Regarding the API Requests, at the moment the request is being made via a NodeRed instance using the default http request node.
- Have calls to
/api/v4/nodesevery 10 seconds, and at every restart of Nodered (happens quite a lot, since i am developing the dashboard) - Have calls to
api/v4/brokersfor every restart of NodeRed Flows - Have calls to
api/v4/metricsevery second, and at every restart of NodeRed - Have calls to
api/v4/clientsevery 30 seconds, and at every restart of NodeRed - Have calls to
api/v4/prometheusevery 5 seconds, and at every restart of NodeRed
Could the problem be "exaggerated" due to the fact that there are too many API calls, and the fact that they could be done at the "same time".
The broker at the moment has a total of 4~5 clients connected, 2 clients are sending between 2~20messages a second mainly published to a single topic, but occasionally are messages in other topics, and there are 2~3 clients connected subscribing the messages from the 2 clients.
BTW, It will only be considered as a bug if the FD doesn't decrease after no client keeps disconnecting. Otherwise is just a simple networking problem.
you can refer to long discussion here #1546
Thaks for the hint.
Strange i searched in github by the error, and nothing showup..
I will try to check the ulimit and the number of FD files open. At the moment is quite low (is at 18) (but during the night there are no clients connected sending MQTT messages).
Thank You
Hi
Just to update.
It is indeed the same behavior as in #1546
After sometime of broker is working, the FD Handles stop being "recycled". It doesn´t happen imediattly, but only starts to happen after around 30~40 hours of broker operation. After this whenever a clients disconnects it seems the FD Handlers are not released they keep increasing with newer connections, and even if no new connections happen for a while the number would not decrease.
This a chart with the number of FD Handles or the last 24 hours
The big jumps that appear around 12-13 and from 16-21 was when i was working on a project (using Node-Red) and it was necessary to restart the project (and the broker connection) many times.
At this point, if i make a new connection and then disconnect, the FD Handles will always increase (tested with MQTT Clients MQTT Explorer and MQTTX)
If I run sockstat nanomq only a couple of sockets appear associated to nanoMq
tis nanomq 2594240 tcp4 *:1883 *:* LISTEN
tis nanomq 2594240 tcp4 *:8083 *:* LISTEN
tis nanomq 2594240 tcp4 *:8081 *:* LISTEN
tis nanomq 2594240 tcp4 86.48.3.211:8083 104.152.52.161:53537 ESTABLISHED
tis nanomq 2594240 tcp4 127.0.0.1:8081 127.0.0.1:47408 ESTABLISHED
tis nanomq 2594240 tcp4 127.0.0.1:1883 127.0.0.1:43120 ESTABLISHED
tis nanomq 2594240 tcp4 127.0.0.1:8081 127.0.0.1:49402 ESTABLISHED
tis nanomq 2594240 tcp4 86.48.3.211:1883 94.61.155.128:64840 ESTABLISHED
tis nanomq 2594240 tcp4 86.48.3.211:8083 185.37.209.99:6784 ESTABLISHED
runing ls -l /proc/{pid}/fd effectivally gives a lot of handles associated (most of them are red in the console)
nanomq_fd_handles.txt
And here is the nanomq log at the moment, nothing extraordinary in the logs, at least nanomq.log
Increasing the number of file handlers for the broker, doesn't seems a valid solution, it will only delay the inevitable crash and restart of the broker.
Thanks
#1546 is already fixed, and that issue only happens with TLS. Make sure you are using ver after that fix. but I havent done a lot of test on websocket though. would be appreciate if you can disable websockt to check if issue is still there.
So... as I said, Have you verified whether the number of used FD decreased after you stop connect/disconnect clients?
And I noticed a lot of "Warning: close pipe & kick client due to KeepAlive timeout!" in the log, apprantely clients didnt close TCP socket elegantly, this will stack up fd use. Also, a lot of QoS unack msg, your clients seems out of rage, and cannot keep the pace up, suggest you turn to qos0.
#1546 is already fixed, and that issue only happens with TLS. Make sure you are using ver after that fix. but I havent done a lot of test on websocket though. would be appreciate if you can disable websockt to check if issue is still there.
I mentioned #1546 since the behaviour is very identical. Regarding the WebSocket, I really can't disabled it, the devices that are sending data to the broker unfortunately are only able to use WebSocket, but these devices are not "restarting" very frequently. At least I am not receiving Connect/Disconnect events either via API or in the subscription I have to the SYS topic. Node-red is the client that is restarting frequently and possibly do not terminate the connection "gracefully" when I perform a redeploy. Node-Red is connecting using TCP (standard connection).
So... as I said, Have you verified whether the number of used FD decreased after you stop connect/disconnect clients?
Well, during the night (after 22) there were no more client reconnects (except some web bots / crawlers) the FD number dind't reduce, in fact it increase for each bot connection.
Even if I use a more "standard" client, like MQTTX to connect / disconnect, at the moment the FD handlers increase for each connection I make, and they don't seem to disappear after a while.
In fact, at the moment was setting up another instance of NodeRed, and forgot to configure the user name and password for the broker, and FD count is rising for each attempt of connection from NodeRed..
the log entries look like this, for each connection refused:
2024-03-10 14:50:47 [2594247] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/protocol/mqtt/nmq_mqtt.c:749: Invalid auth info.
2024-03-10 14:50:47 [2594259] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:641: nni aio recv error!! Object closed
2024-03-10 14:50:47 [2594259] WARN /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:909: tcptran_pipe_recv_cb: parse error rv: 139
And I noticed a lot of "Warning: close pipe & kick client due to KeepAlive timeout!" in the log, apprantely clients didnt close TCP socket elegantly, this will stack up fd use.
When this happens the FD indeed start to stack-up and seems they are never released. This behaviour do not happen immediately after the server restarts. In the log file there are instances of this error after the restart (2024-03-07) but the FD count kept low at least until the end of next day.
Also, a lot of QoS unack msg, your clients seems out of rage, and cannot keep the pace up, suggest you turn to qos0.
I found that strange, I am almost sure that I have the clients doing the publish with QOS:0, the data is not critical, but something may be incorrect configured when publishing or not defining the QOS at publish time and the library using QOS:1 as default. Will have a look tomorrow at that, don't have access to the devices making the publishing outside the company.
I found there is indeed a bug with session keeping. have you set clean start = false in your client?
Hello
All my clients are supposed to be operating with Clean Session set to true.
"supposed"? I guess there might be one client "accidentally" enable session keeping and triggers this bug. anyway you can have a try on the new fix
Well, "Supposed" since the mqtt part is a "black box" not made by me, so i can't say for sure 😄
Just checked with tshark, that the clients (Node-Red, the one that performs a lot of restarts and the devices) are connecting with Clean Session set to true.
Regarding trying the fix, when is the next release schedule to be?
I don't have enough know how / experience to attempt to sync the git and force the submodule to another branch, compiling and installing the software without breaking anything in the server side...
Maybe in the meanwhile an automatic build (GitHub actions) will be made with the updated branch of NanoNNG and will try at that time.
Thank You
Fine, one of those clients must be clean start disabled. Will publish a pre-release for you
Just to update
I think I have confirmed what is happening.
As long as a single client connects, with flag clean session set to false the fd handlers stop being released and keep increasing with any new connection (independently of the clean session flag state in those new connections).
This was tested with MQTTX, after restarting nanomq, attempt a connection with clean session at false, as expected while is connected the fd increase by one, disconnect and fd number remains the same, reconnect and fd number increases. Repeat and fd number keeps increasing even if clean session is set to true. Same behaviour if usename is wrong, as long clean session is set to false this happens.
And the fact that clients are connecting with clean session set to false is outside of my control, all it takes is a "scan bot" to attempt to make a connection that has the clean session to false to trigger this behaviour, even if the username/password is not valid.
I have checked with Wireshark, all my devices (and helper client software such as MQTTX) are connecting to the mqtt broker have the clean session set to true.
For now, just have a CRON job restarting nanomq every day at 04h, to minimize the issue.
you can have a try on 0.21.7 now
Thanks.
Have updated, made a quick test using mqttX and the simulating connection with clean session set to false seems to not being increasing the FD Handlers with each new connection.
Will wait a couple of days to give extra feedback.
Thank You
fixed and tested