zigbee2mqtt icon indicating copy to clipboard operation
zigbee2mqtt copied to clipboard

Z2M crash when removing the device to the group (and the device does not belong to the group)

Open grafalex82 opened this issue 1 year ago • 2 comments

What happened?

I am writing a set of automated tests for my device. These tests communicate with z2m by sending mqtt messages, and verifying z2m response.

A subset of tests suppose to check groups functionality of my device. In order to do this a test group is created, and device endpoints are added to this group. In order to ensure group and group membership is correct, the test framework resets the group prior running the tests - previous groups and group membership are cleared, and then immediately created again. This is needed to mitigate the case when a test adds the device to the group and stops. Next tests have to cleanup everything, and reset the setup just to ensure the test environment is correct.

So it is expected that the test tries to remove the device that was not a part of the group.

Problem: group/members/remove request crashes and restarts z2m when trying to remove the device from the group, if the device was not a part of the group.

Zigbee2MQTT:debug 2024-01-01 16:36:52: Received MQTT message on 'zigbee2mqtt/bridge/request/group/add' with data '{"friendly_name": "test_group", "id": 1234}'
Zigbee2MQTT:info  2024-01-01 16:36:52: MQTT publish: topic 'zigbee2mqtt/bridge/response/group/add', payload '{"data":{"friendly_name":"test_group","id":1234},"status":"ok"}'

Zigbee2MQTT:debug 2024-01-01 16:37:16: Received MQTT message on 'zigbee2mqtt/bridge/request/group/members/remove' with data '{"device": "my_test_switch/2", "group": "test_group"}'
Zigbee2MQTT:info  2024-01-01 16:37:16: Removing 'my_test_switch' from 'test_group'
Zigbee2MQTT:info  2024-01-01 16:37:16: MQTT publish: topic 'zigbee2mqtt/bridge/response/group/members/remove', payload '{"data":{"device":"my_test_switch/2","group":"test_group"},"status":"ok"}'
Zigbee2MQTT:info  2024-01-01 16:37:16: Configuring 'my_test_switch'
Zigbee2MQTT:info  2024-01-01 16:37:16: Successfully configured 'my_test_switch'
Error: Read 0x00158d0002b501a7/1 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 36944 - 1 - 6 - 6 - 1 after 10000ms)
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:569:17)
    at processTimers (node:internal/timers:512:7)

Important note: my device has several endpoints, and the endpoint 1 does NOT implement OnOff cluster - these clusters live on endpoints 2 and 3. At the same time the error above shows that Z2M tries to communicate with OnOff cluster on endpoint 1 (see '0x00158d0002b501a7/1')

Important note 2: adding "skip_disable_reporting": true to the remove request solves the issue, which may signal the problem is in reporting function.

What did you expect to happen?

Problem: Remove device from the group executes normally, z2m does not crash, despite the fact the device is not a part of the group.

Note: Z2M sends 'Remove a Group' message to the device, and the device responds with Status="Group not found". At the same time the Z2M respond with status="ok", which may be not very accurate.

How to reproduce it (minimal and precise)

  1. create a group test_group using the mqtt request zigbee2mqtt/bridge/request/group/add {"friendly_name": "test_group", "id": 1234}

  2. Remove test device from the group (even though it may not belong to the group) zigbee2mqtt/bridge/request/group/members/remove {"device": "my_test_switch/2", "group": "test_group"}

Zigbee2MQTT version

1.35

Adapter firmware version

20200211

Adapter

CC2652

Debug log

No response

grafalex82 avatar Jan 01 '24 15:01 grafalex82

Could you provide the herdsman debug log of this issue?

See this on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

Koenkk avatar Jan 02 '24 08:01 Koenkk

Sure. Here it is. crash.log

grafalex82 avatar Jan 02 '24 16:01 grafalex82

Pushed a fix, can you check if it works?

Changes will be available in the dev branch in a few hours from now.

Koenkk avatar Jan 03 '24 08:01 Koenkk

@Koenkk I tried using latest-dev image from dockerhub, and unfortunately it still repro. Do these changes published to dockerhub?

grafalex82 avatar Jan 04 '24 20:01 grafalex82

Yes, to double check, could you provide the herdsman debug log of starting z2m until this error?

Koenkk avatar Jan 05 '24 08:01 Koenkk

Here is the log on the current latest-dev image log2.txt

grafalex82 avatar Jan 05 '24 20:01 grafalex82

Can you provide the data/database.db entry of my_test_switch?

Koenkk avatar Jan 06 '24 09:01 Koenkk

@Koenkk The database entry attached. db.txt

Here is the converter (just in case): myswitch.js

As I mentioned in the issue description, my switch does not have OnOff cluster on endpoint 1. This endpoint has just service and device-wide clusters (basic, ota). The OnOff stuff is implemented on endpoints 2 and 3.

grafalex82 avatar Jan 06 '24 09:01 grafalex82

The issue is because of the ext converter, try with:

for (const endpoint of device.endpoints) {
            await endpoint.read('genOnOff', ['onOff']);
            await endpoint.read('genOnOffSwitchCfg', ['switchActions']);
            await endpoint.read('genOnOffSwitchCfg', [65280, 65281, 65282, 65283, 65284], manufacturerOptions.jennic);
}

Koenkk avatar Jan 06 '24 10:01 Koenkk

@Koenkk Thank you for pointing me to this issue. I did not know this for vs forEach nuance. I fixed the converter and it no longer crashes.

The only issue that remains unclear is that it sets a new binding from my device to coordinator when I call remove from group request. It does not look correct. Does it have any reason behind?

grafalex82 avatar Jan 06 '24 15:01 grafalex82

When devices are removed from groups, their reporting is reset (as it is setup when added to a group). Because of this a reconfigure is done.

Koenkk avatar Jan 07 '24 07:01 Koenkk

Do you know why it sometimes sets up a binding when reset reporting, and sometimes does not? Could you point me to the piece of z2m code so that I could understand it better?

This uncertainty severly impacts automated testing, as it expects the device and z2m behave in the same way every time.

grafalex82 avatar Jan 07 '24 14:01 grafalex82

What do you mean with reset reporting?

Koenkk avatar Jan 08 '24 18:01 Koenkk

The issue I am seeing is spontaneous bind requests after my device is added or removed to/from the group. These bind requests do not happen always, and honestly I have not found an exact scenario when it happens.

Perhaps I misunderstood one of your posts above, treating it as these bind requests somehow related to reconfiguration process called after group remove operation. I would be grateful if you could provide any information on this. If not - this is not a problem. We can close the issue, as the initial problem is solved.

grafalex82 avatar Jan 08 '24 20:01 grafalex82

It's expected that it happens after removing/adding to group. If you want to dive further on:

These bind requests do not happen always, and honestly I have not found an exact scenario when it happens.

The debug log might give some clues (try to capture one where it does and doesn't)

Koenkk avatar Jan 09 '24 19:01 Koenkk

Thank you, @Koenkk I think the original problem is fixed, and this bug can be closed. I moved my binding issue in a separate report (https://github.com/Koenkk/zigbee2mqtt/issues/20736)

grafalex82 avatar Jan 11 '24 11:01 grafalex82