dirigera icon indicating copy to clipboard operation
dirigera copied to clipboard

Unexplained 'disconnect due to inactivity' Redux

Open bj-github opened this issue 4 months ago • 4 comments

Re #27 , #95 : I seem to get the Status=1000: 'disconnected due to inactivity.' when I send an ON/OFF state change request to my "TRETAKT Smart plug" (V "1", Firmware "2.4.4"). In addition to seeing an on_pong event every minute, I also tried adding an explicit

dirigera_hub.get( route = "/route/<gatewayId>" )

query of the "DIRIGERA Hub for smart products" (V "P2.5", Firmware "2.815.2") at regular intervals inside the on_pong handler.. That doesn't seem to stop this, regardless of how close to the ON/OFF trigger it occurs.

The problem for me is that having the Hub drop out at that time means I don't see the outlet's state change event. The "close" happens before it, and it's gone by the time the SystemD restarts my gateway monitoring service. (For now I resort to sending a "current state" notification during that restart.)

Hoping that this extra data point might help you.

bj-github avatar Aug 10 '25 08:08 bj-github

Another data point: I now have the hub query triggering every hour. ~3h after restart, shortly after the 3rd Hub query, I tried (short-)pressing the ON button my "RODRET Dimmer" controller (V "1", Firmware "1.0.57"). This (presumably?) triggered an immediate Status=1000 "on_close" event as described above. I did not see the "remotePressEvent" nor the "deviceStatusChanged" event from the TRETAKT outlet. SystemD quickly restarted the listener, and pressing the RODRET's OFF button a few seconds later did trigger both of those. I also have it logging the "on_pong" events and can see those happening every minute as expected.

bj-github avatar Aug 11 '25 04:08 bj-github

I wondered if my "lazy" use of the same key for the systemd-managed "event listener" and the cron-managed event triggers might have been in conflict. I generated another key for the former. But it didn't make any difference.

I instrumented all the on_... procedure parameters of create_event_listener to log each time they are called, along with their parameters. That showed the on_pong event is triggered at 1 minute intervals.

I tried adding an explicit get query of the Dirigera device inside that on_pong procedure to occur hourly. That didn't stop the Status = 1000 exits when an event subsequently occurs.

I tried replacing the get against the Dirigera with a a set_on call against the TRETAKT plug, albeit the request was to set it to its current setting (to avoid an actual state change, which are triggered by the cron-based tasks). That also didn't stop the Status = 1000 exits.

I set up a copy of the systemd-based listener on a separate device, including generating a new key for that. In general that showed the same behaviour: a state change on the TRETAKT triggered by the cron task on the first device would also cause this second listener to terminate with a Status = 1000 exit. Sort of ...

I did a systemctl restart ... on the second listener after both had recycled almost two hours earlier due to some event. I then clicked the "On" button on my RODRET and here's what happened in the logs.

Firstly on the S04 device, which hosts both the first listener and the cron-based trigger tasks):

Aug 15 12:43:56 S04 ZigBee[93311]: > Pong: Msg [b'']
Aug 15 12:44:56 S04 ZigBee[93311]: > Pong: Msg [b'']
Aug 15 12:44:59 S04 ZigBee[93311]: > Close: Sts [1000], Msg ['disconnected due to inactivity.']
Aug 15 12:44:59 S04 ZigBee[93311]: Exited "create_event_listener" ...
Aug 15 12:45:02 S04 ZigBee[93798]: Connected to Hub [Dirigera.Kramer502]
Aug 15 12:45:02 S04 ZigBee[93798]: Detected Devices ...
Aug 15 12:45:02 S04 ZigBee[93798]: . [ZigHub]: [gateway]
Aug 15 12:45:02 S04 ZigBee[93798]: . [Remote1]: [controller] - Battery @ [100]%
Aug 15 12:45:02 S04 ZigBee[93798]: . [Plug1]: [outlet] - currently [ON]
Aug 15 12:45:02 S04 ZigBee[93798]: Subscribing to events...
Aug 15 12:45:02 S04 ZigBee[93798]: > Open
Aug 15 12:46:05 S04 ZigBee[93798]: > Msg: Remote1: Button [singlePress]
Aug 15 12:46:05 S04 ZigBee[93798]: > Msg: Plug1: ON -> OFF
Aug 15 12:47:02 S04 ZigBee[93798]: > Pong: Msg [b'']
Aug 15 12:48:02 S04 ZigBee[93798]: > Pong: Msg [b'']

On the second, S05 device hosting the second listener, also showing the restart shortly before:

Aug 15 12:38:55 S05 ZigBee[21844]: > Pong: Msg [b'']
Aug 15 12:39:55 S05 ZigBee[21844]: > Pong: Msg [b'']
Aug 15 12:40:11 S05 ZigBee[22074]: Connected to Hub [Dirigera.Kramer502]
Aug 15 12:40:11 S05 ZigBee[22074]: Detected Devices ...
Aug 15 12:40:12 S05 ZigBee[22074]: . [ZigHub]: [gateway]
Aug 15 12:40:12 S05 ZigBee[22074]: . [Remote1]: [controller] - Battery @ [100]%
Aug 15 12:40:12 S05 ZigBee[22074]: . [Plug1]: [outlet] - currently [OFF]
Aug 15 12:40:12 S05 ZigBee[22074]: Subscribing to events...
Aug 15 12:40:12 S05 ZigBee[22074]: > Open
Aug 15 12:42:12 S05 ZigBee[22074]: > Pong: Msg [b'']
Aug 15 12:43:12 S05 ZigBee[22074]: > Pong: Msg [b'']
Aug 15 12:44:12 S05 ZigBee[22074]: > Pong: Msg [b'']
Aug 15 12:44:59 S05 ZigBee[22074]: > Msg: Remote1: Button [singlePress]
Aug 15 12:44:59 S05 ZigBee[22074]: > Msg: Plug1:  OFF -> ON
Aug 15 12:45:12 S05 ZigBee[22074]: > Pong: Msg [b'']
Aug 15 12:46:05 S05 ZigBee[22074]: > Msg: Remote1: Button [singlePress]
Aug 15 12:46:05 S05 ZigBee[22074]: > Msg: Plug1:  ON -> OFF
Aug 15 12:46:12 S05 ZigBee[22074]: > Pong: Msg [b'']
Aug 15 12:47:12 S05 ZigBee[22074]: > Pong: Msg [b'']

So the latter picked up both the ON and the OFF event triggered from the RODRET remote soon after the recycle.

On the former, the 'ON' triggered a Status = 1000 exit, which caused systemd to restart it. The restart reported the now-'ON' state, It subsequently picked up the 'OFF' event and processed it as expected.

So it appears there is some period of time where the Dirigera triggers the Status = 1000 "inactivity' threshold, but that is not surfaced as part of the ping/pong mechanism. Which makes one wonder what the value of that ping/pong mechanism might be? (Pings are clearly being sent, because the Dirigera is reporting the pong responses.)

From other things I've seen, I can also report that apparently random disconnects are probably caused by the devices themselves doing periodic checks for firmware updates. That triggers changes to the otaState attribute whenever it occurs. Although I've seen the state change reported as readyToCheck, the same value as last seen, without an intervening checkInProgress: perhaps because it can be so quick that it's finished by the time the notification is sent to the listener?

So I guess this leaves the question of how long after starting a listener before the 'inactivity' threshold is reached, notwithstanding the ping/pong mechanism that I assume was meant to avoid that?

bj-github avatar Aug 15 '25 03:08 bj-github

Thanks a lot for the detailed report and for sharing all the experiments you tried, that’s really helpful context. From what you’ve shown, the 1000: disconnected due to inactivity seems to come from the hub itself and isn’t tied to the ping/pong mechanism in the websocket layer. My guess is it’s a quirk in the Dirigera firmware, so the best we can do on the client side is to make sure the library reconnects gracefully whenever this happens.

I’ve been considering adding an example of “auto-reconnect on 1000 close” into the docs. Something along the lines (I did not test this):


def on_close(ws: Any, status_code: int, message: str):
    print(f"Closed with status={status_code}, msg={message}")
    if status_code == 1000:  # hub 'inactivity' close
        # backoff before reconnect
        time.sleep(2)
        print("Reconnecting to Dirigera hub...")
        start_listener()

def start_listener():
    dirigera_hub.create_event_listener(
        on_message=on_message,
        on_error=on_error,
        on_close=on_close,
    )

would something like this work in your case?

Leggin avatar Aug 19 '25 11:08 Leggin

That was my best guess after looking through your code, etc. But my career was data requirements and modelling, not programming, protocols, etc. so I didn't want to bias things with my assumption.

If there's a place to report such issues to Ikea, I couldn't find it. It also "feels" that their efforts these days are around the Matter protocol. Which looks a bigger, more complex "playing field" to my untrained eye.

During my experiments I had tried a simple while true: ... loop around the create_event_listener call. A bit sloppy, and not for production but it sorta worked. Ultimately though I decided to revert it and rely on the SystemD configuration to have it auto-restarted each time it stopped normally. That obviously also yields a delay during the recycle such as you suggest in your code.

Also, in the on_pong handler I got down to a hub query every 4 minutes to no avail. So I ripped all that out as well.

The down side with that is missing events, such as a light being turned OFF/ON. The restart sends me a message of the current state, but I don' get the context of the event happening. Then there are those firmware update checks that also trigger a connection-stopping state change. I think I saw something about the Matter spec requiring that to occur every 12 hours? Multiply that by the number of devices and the whole thing recycles a number of times per day.

Next step is to perhaps implement a persistent store (SQLite probably, versus a JSON/TOML/whatever-formatted file), so that the restart can highlight anything that changed across restart. Except for Firmware checks that didn't actually find anything needing an update ( i.e., no Firmware version# change).

It wasn't clear to me if the OpenHAB and HomeAssistant (HA) folk have had to face the issue. But discussions in each case perhaps were favouring use of the Matter protocol now that it's available? Alas, I couldn't find Python-based code libraries for Matter out of the HA work that I could understand enough to use.

Thank you for your work, and response.

bj-github avatar Aug 21 '25 06:08 bj-github