MQTT client timeout
I am running docker images ozwdaemon:latest and eclipse-mosquitto:latest as of yesterday on a Raspberry Pi together with Home Assistant 0.113.1. I have about 90 devices and over 500 entities.
As long as I don't really do anything, it's running pretty stable. However, using OZWAdmin-0.1.74, if I push the z-wave network a bit too far, such as healing and/or refreshing too many nodes at the same time, basically clogging the network with messages, the ozwdaemon mqtt client doesn't seem to be able to keep up. All entities in Home Assistant becomes unavailable, and I see the following in the mosquitto logs:
Client qt-openzwave-1 has exceeded timeout, disconnecting.
So I restart the ozwdaemon docker instance, and I see the folling in the Mosquitto logs:
New connection from 172.18.0.2 on port 1883.
Socket error on client <unknown>, disconnecting.
New connection from 172.18.0.2 on port 1883.
New client connected from 172.18.0.2 as qt-openzwave-1 (p2, c1, k60).
Then it works for just about under two minutes before it gets disconnected again:
Client qt-openzwave-1 has exceeded timeout, disconnecting.
From what I understand (please, correct me if I'm wrong) that "k60"-part in the Mosquitto logs means "keepalive = 60", i.e. the MQTT client tells the broker when connecting that it will stay in touch with a ping message at least once every minute, and if that doesn't happen, the client will be disconnected.
I increased logging in mosquitto (by setting "log_type all" in mosquitto.conf) and also started ozwdaemon with
-e QT_LOGGING_RULES="*.debug=false;ozw.mqtt.publisher.debug=true"
and I can see in the mosquitto logs
...
Received PUBLISH from qt-openzwave-1 (d0, q0, r1, m0, 'OpenZWave/1/node/1/instance/1/commandclass/32/value/562949970722835/', ... (478 bytes))
Sending PUBLISH to auto-3E2A0E60-FB05-A814-00F1-5DE1DEFD51A0 (d0, q0, r0, m0, 'OpenZWave/1/node/1/instance/1/commandclass/32/value/562949970722835/', ... (478 bytes))
Sending PUBLISH to qt-openzwave-1 (d0, q0, r0, m0, 'OpenZWave/1/node/1/instance/1/commandclass/32/value/562949970722835/', ... (478 bytes))
Received PINGREQ from auto-3E2A0E60-FB05-A814-00F1-5DE1DEFD51A0
Sending PINGRESP to auto-3E2A0E60-FB05-A814-00F1-5DE1DEFD51A0
Client qt-openzwave-1 has exceeded timeout, disconnecting.
while ozwdaemon continues to print out hundreds of rows such as
...
[ozw.mqtt.publisher] [debug]: Publishing Event valueAdded: 562952802893846
...
for several minutes until it realizes that the connection is gone:
[ozw.mqtt.publisher] [debug]: Publishing Event valueRefreshed: 562950595969074
[ozw.mqtt.publisher] [debug]: Publishing Event valueRefreshed: 72057594680475696
[ozw.mqtt.publisher] [debug]: MQTT State Change "Disconnected"
[ozw.mqtt.publisher] [warning]: Exiting on Failure
[ozw.mqtt.publisher] [warning]: MQTT Client Disconnnected
[ozw.mqtt.publisher] [warning]: MQTT Client Error "Transport Invalid"
The only way I can get it to stay up is to remove/rename the ozwcache_0xf7b52c8f.xml file and restart, but it doesn't feel like a good solution.
Any ideas on what's going on?
[ozw.mqtt.publisher] [debug]: Publishing Event valueAdded: 562952802893846
I'm gathering that the hundreds of these messages, the number is changing?
If so, yes, I have a idea. We are not yielding to allow the actual network processing to happen. If you can confirm the above, the fix should be simple.
Yes, exactly. The numbers are changing for every row.
Also, it's not only valueAdded, it can be all sorts of different z-wave stuff, such as nodeGroupChanged, valueRefreshed etc.
Same issue here. It also happens sometimes on startup of the qt-ozw-contianer. Tried to adjust the keepalive_timeout in mosquitto.conf, but mosquitto did not like that, and would not start.
It seems like when the mqtt-disconnect happens, the ozwdaemon is stuck using 100% CPU. I thought it would exit (and restart the ccontainer?)
I am also experiencing this issue, occurring when the ozw container is starting up. Resulting in a client timeout and disconnection.
In the previous zw1.4, ozwlog shows I have a very chatty network so assuming this is causing too many messages, and ozw is delayed in sending a keepalive to mqtt.
As with OP, the only way for the ozwd to remain connected to mqtt (not getting timeout) is to trash the ozwcache file.
Any progress on this one? I have a large network as well and can't get past the initialization step without this timeout and eventually shutdown of ozwdaemon.
1602753420: Client qt-openzwave-1 has exceeded timeout, disconnecting.
@sirfooey I can't find any ozwcache file, where is it located?
@jlengq Check your OZW container volume location, it should be right there.
Thanks, I found it! Does unfortunately not solve my problem ,
I have 100+ nodes and the interview process launched at startup seems to choke the MQTT network somehow, resulting in the timeout. Deleting the ozwcache only seems to start the whole process over again?
@jlengq yes, trashing the file starts the whole discovery process again (no loss of any node data); but sometimes that's what is needed to get it fully operational. Some people have reported better stability with Build 150 (docker pull openzwave/ozwdaemon:allinone-build-150), might be worth trying that as well.
I've now moved from qt-openzwave to zwave2mqtt. I noticed that during some operations while the controller is waiting for replies and they take a long time to come (and timeout, usually), z2m also shows as "disconnected", but after a while it reconnects. This is probably what is making z2m more reliable: it reconnects to mqtt.
Has anyone tried increasing the MQTT client timeout in ozwd to see if that works around the problem? Maybe increasing the timeout would allow ozwd to finish whatever it's doing before disconnecting (assuming it's not in an infinite loop).
I think it should be as simple as adding a call to this->m_client->setKeepAlive(360); in the code below. That would set the timeout to 3 mins instead of the default 1 minute. Adjust as necessary or use preferably set via an environment variable.
https://github.com/OpenZWave/qt-openzwave/blob/89cc0d86c983101aacd89c780bae18bb3dffe9b4/qt-ozwdaemon/mqttpublisher.cpp#L54-L63
Not sure if there's any downside to increasing the time besides not reacting as quickly for real timeouts.
So far with limited testing, I can confirm with @kpine's suggestion, ozw is able to start up with a pre-existing ozwcache file, whereas in the past, I would 100% of the time get timeout disconnects until I deleted the ozwcache file.
Just added a PR for that. I still think there are better fixes to be made, but at least it will make the daemon start up.
I too have encountered this issue. I have rebuild my entire HA setup around using Docker such that I can test if turning off logging and what not improves the situation (which it did, but it still fails to talk to MQTT sometimes).
I also took a look at how the ping/pong works and from what I can tell it's all built into qtmqtt using a timer, the only reason I can imagine that the timer wouldn't fire is if the event loop was blocked.
I am having the same issue, 100+ nodes ozw daemon won't stay up.
I have just added additional devices to my zwave network. I am now having the same issue. ozw daemon goes offline during startup.
So I spent an ungodly amount of time figuring out how to get a local debug build of ozwdaemon running on my MBP. I can't exactly reproduce the issue on my MBP (presumably because its too fast to trigger the issue), but what I do see is that the MQTT timer for pings is being invoked as expected (I also set up a 1s timer which triggers precisely on time). I also confirmed that blocking the event loop for a period of time definitely causes the MQTT timer not to be invoked. After digging into the code a bit, it looks like OZWNotification schedules it's events to be processed by the main thread using Qt::QueuedConnection, I have a suspicion that what's going on is that Open-Zwave is generating events so quickly that the Pi cannot keep up on the main thread. This causes the queue of queued signals to become saturated with queued events from OZW, which leads to the timer not being able to get in to fire. Something else that makes this seem likely is that I did some profiling and a huge amount of time is spent serializing events for MQTT, which is likely what is making each event take so long to be processed on the main thread. Assuming this is an accurate assessment, I see a couple of paths forward.
- Use Qt::BlockingQueuedConnection to put some back-pressure onto OZW such that it doesn't saturate us with events during startup. This does require that OZW itself has internal back-pressuring mechanisms that can ensure that the thread that's communicating with the dongle doesn't saturate any internal OZW queues (this could cause the dongle to timeout).
- Refactor the MQTT handling to automatically reconnect after a disconnect. This is certainly something that should occur anyways, but I think it's sort of hiding the underlying issue. Saturating the event loop with stuff can cause all sort's of havoc.
- Move processing off the main thread. This is definitely the best option out of all of them. Having the event-processing for QTOZW and the network-handling for MQTT on their own threads would enable them to flow events between each-other without interfering with ordering or maintenance work that they need to be performing.
- Accept something like Pull Request #185 which adjusts the MQTT Timeout to no longer triggered while the main thread is being saturated during startup (although it should probably be something like 10m). There isn't a whole lot of downside to doing this since taking longer to discover a 'lost client' is mostly irrelevant. Similar to 2 above, this is sort of hiding the issue, though it's quick and easy.
What do you think @Fishwaldo ?
I totally agree that #185 is just a workaround, and for me, yout option 3 seems like the best one. Unfortunately I don't have enough experience with C++ and QT to help.
Regarding option 2, there has been some discussion in another issue (could not find it here and now), and a problem is apparently that it is hard to keep track of the states (and what messages in either direction that might be lost) if you just do a MQTT-reconnect without restarting OZW at the same time.
I really hope fishwaldo is well, and will be back from his involuntary break soon...
So I spent an ungodly amount of time figuring out how to get a local debug build of ozwdaemon running on my MBP. I can't exactly reproduce the issue on my MBP
This is where I am at right now :)
docker buildx build --platform linux/arm -f Docker/Dockerfile -t qt-ozw-allinone-timeout .
WARN[0000] invalid non-bool value for BUILDX_NO_DEFAULT_LOAD:
WARN[0000] No output specified for docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
[+] Building 854.4s (16/44)
quickly that the Pi cannot keep up on the main thread. This causes the queue of queued signals to become saturated with queued events from OZW, which leads to the timer not being able to get in to fire.
So the classic CS problem of thread starvation?
Man I am rusty with C++, but having non-working lights is making me want to pick it back up.
Until there's an actual fix for this issue, is there any way to take CPU resources from the core-openzwave docker container to artificially slow it down so that it doesn't overwhelm mqtt? Perhaps Portainer has something useful. I'll check this idea out tomorrow, because otherwise I'm just dead in the water.
what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely.
try this docker compose
version: '3'
services:
mqtt:
image: eclipse-mosquitto
container_name: "mqtt-bridge"
volumes:
- ./mqtt:/mosquitto
- ./mqtt/data:/mosquitto/data
- ./mqtt/log:/mosquitto/log
ports:
- "1883:1883"
- "9001:9001"
restart: always
ozwd:
image: openzwave/ozwdaemon:latest
container_name: "ozwd"
depends_on:
- "mqtt"
security_opt:
- seccomp:unconfined
devices:
- "/dev/serial/by-id/usb-xxx"
volumes:
- ./ozw:/opt/ozw/config
ports:
- "1983:1983"
- "5901:5901"
- "7800:7800"
environment:
MQTT_SERVER: "pi.local.net"
MQTT_USERNAME: "[redacted]"
MQTT_PASSWORD: "[redacted]"
USB_PATH: "/dev/serial/by-id/usb-xxx"
OZW_INSTANCE: "1"
OZW_NETWORK_KEY: "[redacted]"
restart: always
Here is mqtt config
persistence true
persistence_location /mosquitto/data/
log_dest file /mosquitto/log/mosquitto.log
password_file /mosquitto/config/passwd
allow_anonymous false
# External MQTT Broker
connection zpie01
address hassio.local.net
topic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as needed
remote_username [redacted]
remote_password [redacted]
Depending on the MQTT server you have, the performance of your device and the size of the network, you could improve things with those kinds of changes. With my network of 58 nodes, even with a local MQTT and Pi4, it couldn't complete quickly enough.
I'll try to remember to push my docker image with Olen's workaround (which is perfectly reasonable to use in "production") tomorrow morning.
Cheers, Brett
On Tue., Dec. 1, 2020, 8:13 p.m. m3ki, [email protected] wrote:
what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely.
try this docker compose
version: '3'services: mqtt: image: eclipse-mosquitto container_name: "mqtt-bridge" volumes: - ./mqtt:/mosquitto - ./mqtt/data:/mosquitto/data - ./mqtt/log:/mosquitto/log ports: - "1883:1883" - "9001:9001" restart: always ozwd: image: openzwave/ozwdaemon:latest container_name: "ozwd" depends_on: - "mqtt" security_opt: - seccomp:unconfined devices: - "/dev/serial/by-id/usb-xxx" volumes: - ./ozw:/opt/ozw/config ports: - "1983:1983" - "5901:5901" - "7800:7800" environment: MQTT_SERVER: "pi.local.net" MQTT_USERNAME: "[redacted]" MQTT_PASSWORD: "[redacted]" USB_PATH: "/dev/serial/by-id/usb-xxx" OZW_INSTANCE: "1" OZW_NETWORK_KEY: "[redacted]" restart: always
Here is mqtt config
persistence truepersistence_location /mosquitto/data/ log_dest file /mosquitto/log/mosquitto.log password_file /mosquitto/config/passwdallow_anonymous false
External MQTT Brokerconnection zpie01address hassio.m3ki.nettopic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as neededremote_username [redacted]remote_password [redacted]
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenZWave/qt-openzwave/issues/140#issuecomment-736977908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAML466463XNO4Q4OWZ3YMDSSW5FRANCNFSM4PLIQ25Q .
It looks like adding " --cpu-shares 512 " might be a good start. Would any of you folks here know how I can get that into the command line that HA is using to start up the addon?
Depending on the MQTT server you have, the performance of your device and the size of the network, you could improve things with those kinds of changes. With my network of 58 nodes, even with a local MQTT and Pi4, it couldn't complete quickly enough. I'll try to remember to push my docker image with Olen's workaround (which is perfectly reasonable to use in "production") tomorrow morning. Cheers, Brett … On Tue., Dec. 1, 2020, 8:13 p.m. m3ki, @.***> wrote: what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely. try this docker compose version: '3'services: mqtt: image: eclipse-mosquitto container_name: "mqtt-bridge" volumes: - ./mqtt:/mosquitto - ./mqtt/data:/mosquitto/data - ./mqtt/log:/mosquitto/log ports: - "1883:1883" - "9001:9001" restart: always ozwd: image: openzwave/ozwdaemon:latest container_name: "ozwd" depends_on: - "mqtt" security_opt: - seccomp:unconfined devices: - "/dev/serial/by-id/usb-xxx" volumes: - ./ozw:/opt/ozw/config ports: - "1983:1983" - "5901:5901" - "7800:7800" environment: MQTT_SERVER: "pi.local.net" MQTT_USERNAME: "[redacted]" MQTT_PASSWORD: "[redacted]" USB_PATH: "/dev/serial/by-id/usb-xxx" OZW_INSTANCE: "1" OZW_NETWORK_KEY: "[redacted]" restart: always Here is mqtt config persistence truepersistence_location /mosquitto/data/ log_dest file /mosquitto/log/mosquitto.log password_file /mosquitto/config/passwdallow_anonymous false # External MQTT Brokerconnection zpie01address hassio.m3ki.nettopic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as neededremote_username [redacted]remote_password [redacted] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAML466463XNO4Q4OWZ3YMDSSW5FRANCNFSM4PLIQ25Q .
For me I have 4 pis with 50-100 nodes or so and nothing would work. Even on the test ozwd instance where there were no nodes (what gives !?) OZWD would just disconnect with or without the keepalive modification in the code, iff the mqtt server was external to hassio. eclipse/mosquitto on docker on a beefy VM.
I have a feeling now that there is something going on with a network connection between the pi and the mqtt server. Things worked better if pi would connect to internal mqtt addon on hass server.
If you need to make sure startup completes you can wipeout ozwd cache on your pi that might help, just don't reset your zwave stick and all nodes will come back.
For me my network has been rock solid as of this morning with a setup I mention in my previous comment. PI (running ozwd and mqtt container) --->bridged to hassio mqtt.
By network connection, i meant there is something going on with how ozwd handles traffic and/or congestion. Even with the keepalive increased network wouldn't be stable. At least in my case.
I'm running a supervised install on Ubuntu 20.4 on an RPi4, with the Openzwave addon and the MQTT addon. I've been working to get everything back to normal after switching from the old all-in-one Zwave integration, and I thought I had finally got things right when this issue cropped up.
I'm running a supervised install on Ubuntu 20.4 on an RPi4, with the Openzwave addon and the MQTT addon. I've been working to get everything back to normal after switching from the old all-in-one Zwave integration, and I thought I had finally got things right when this issue cropped up.
A friend did this build yesterday when we were trying to troubleshoot this issue you can try his docker https://hub.docker.com/r/firstof9/qt-ozwdaemon it has an increased timeout.
Keep in mind it kinda worked for me but I would still experience intermittent disconnects every hour or so or when network got busy.
I am now back to the original docker though, with mqtt running on same host and bridging mqtt to mqtt with hassio on a separate VM. So far so good.
You can also wipeout ozwd cache file of your ozwdaemon and see if you can get your network back up.
Adding several more devices to my network caused this problem for me last weekend. I'm around 60 devices now with more to add. I'm running using VirtualBox on an Intel NUC so it doesn't appear easy to make any temporary changes to work around this issue. Last night I switched to Zwave2MQTT and have all my devices connected this morning. I'll have to spend some time renaming devices/entities but at least this will get me going again.
I really need a fix for this. The system fails too frequently as I am trying to add some door sensors and deadbolts. Deleting the cache file will let it restart, but many nodes lose their names, and some disappear. And it takes forever, anyway.
I tried Zwave2MQTT last week, and I switched when I ran into some issue (can't remember what it was now, lol). I may switch back and try it again. I think one issue was that it was clear from what I read that it was an orphaned project, and that the OpenZwave add-on with the OpenZwave integration was the road forward.
Is there any way to get the developers' attention quickly, or should I assume that this state of affairs will persist for a while?
Hey @psgcooldog, I am not a developer of qt-openzwave, but I am a C++ developer. I spent a bunch of time looking into this issue and I am reasonably confident that the fixed timeout should be sufficient for most networks. I have deployed that fix to my own network and have not seen any issues since. If your running into issues still, even after building and deploying the pull requests fix, it's likely there is another issue at play. Cheers, Brett