connectedhomeip [BUG] Fail-safe expiry continues to advertise operational instance name for reverted fabric

Reproduction steps

Recommission an m5stack all-clusters-app ~800+ times in ping-pong fashion: ** [old fabric] ArmFailSafe ** [old fabric] CSRRequest ** [old fabric] AddTrustedRoot ** [old fabric] AddNOC ** [new fabric] CASE ** [new fabric] CommissioningComplete ** [new fabric] RemoveFabric(old-fabric-index)

Eventually, I am seeing that mDNS advertisements are breaking with this signature:

I (42006655) chip[DIS]: Advertise operational node ACF66E6C553BB6D4-4831E7A0CAFFCFAC
I (42006705) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'; instance name: ACF66E6C553BB6D4-4831E7A0CAFFCFAC.
E (42006755) chip[DL]: Failed to post event to CHIP Platform event queue
I (42006755) chip[DIS]: mDNS service published: _matter._tcp
E (42006755) chip[CSL]: Failed to queue CHIP System Layer lambda event: 1000000
I (42006755) chip[DIS]: Advertise operational node B488808FBF8FAADD-EF4CC0D532095C69
E (42006775) chip[DL]: Failed to post event to CHIP Platform event queue
E (42006785) chip[CSL]: Failed to queue CHIP System Layer lambda event: 1000000
E (42006815) chip[DIS]: Failed to register query responder: CHIP_ERROR_NO_MEMORY
E (42006825) chip[DIS]: Failed to find an open operational allocator

This condition persists until the device is restarted. This exhibits when the device

has 4 fabrics
a 5th add is attempted
that add has a fail-safe timeout
the fail-safe timeout does not appear to restart the mDNS server, so the advertisement slots remain full

Bug prevalence

Will occur when addition of the Nth fabric to the fabric in an N-fabric device fails due to fail-safe timeout.

GitHub hash of the SDK that was being used

26dab6c296d5b7d3de03a94320b12a412ff4aa9e

Platform

esp32

Platform Version(s)

No response

Anything else?

No response

Oct 13 '22 12:10 msandstedt

So seems there is a path that should call RemoveQueryResponder but doesn't.

Oct 13 '22 13:10 msandstedt

In investigating, I am trying to figure out how / if / when RemoveQueryResponder is called. We are failing to call it under some circumstances. I am also observing that since some refactors where registered query responders seem to come from Platform::New, we can no longer rely on hitting the case where AddQueryResponder matches an existing entry here:

https://github.com/project-chip/connectedhomeip/blob/2c9cd9210b9858efaa3d87d0d538bd00b53757d6/src/lib/dnssd/minimal_mdns/ResponseSender.cpp#L55

That is, *it == queryResponder will be true rarely or never if I am reading the code correctly. I think this may be the commit where that path becomes non-viable:

commit 59c6bae37dda2f011e853c36b728a18ece5c07f1
Date:   Fri Mar 18 15:00:11 2022 -0400

    Use platform alloc & intrusive lists for managing MinMdns responders (#16181)

I think the leaks are probably happening somewhere at or after this point in the Advertise method:

https://github.com/project-chip/connectedhomeip/blob/2c9cd9210b9858efaa3d87d0d538bd00b53757d6/src/lib/dnssd/Advertiser_ImplMinimalMdns.cpp#L462

As far as I can tell, I think we are leaking the objects allocated here when we hit failure paths in Advertise:

https://github.com/project-chip/connectedhomeip/blob/2c9cd9210b9858efaa3d87d0d538bd00b53757d6/src/lib/dnssd/Advertiser_ImplMinimalMdns.cpp#L412 https://github.com/project-chip/connectedhomeip/blob/2c9cd9210b9858efaa3d87d0d538bd00b53757d6/src/lib/dnssd/Advertiser_ImplMinimalMdns.cpp#L419

Oct 13 '22 13:10 msandstedt

@msandstedt can we somehow show what object leaks?

if we suspect a QueryResponder or QueryAllocator, we could add some instance count global to check. Since this is out of heap seemingly, it could be anything in the provision steps.

Is there a chip-tool or repl script that can replicate this? Some copy&paste/scriptable commands would make this easier to investigate.

Oct 13 '22 15:10 andy31415

@msandstedt can we somehow show what object leaks?

if we suspect a QueryResponder or QueryAllocator, we could add some instance count global to check. Since this is out of heap seemingly, it could be anything in the provision steps.

Is there a chip-tool or repl script that can replicate this? Some copy&paste/scriptable commands would make this easier to investigate.

Unfortunately, on our end we are driving this test from some proprietary tools, so I am unable to provide any script that would be helpful outside of our organization.

However, I've added the instrumentation you've suggested and am re-running to see if I can reproduce and catch a place where there may be a AddQueryResponder without a symmetric RemoveQueryResponder.

Oct 13 '22 18:10 msandstedt

admin_ping_ping.py.tar.gz

Here is a test script that could be used to validate things. I was able to run this on linux like this:

# in a separate terminal, start a light app
rm -f kvs1 && ./out/linux-x64-light/chip-lighting-app --discriminator 2118 --KVS kvs1

# Compile and activate python
scripts/build_python.sh -m minimal -i separate
source out/python_env/bin/activate

# actual test run (limited iteration count)
rm -f admin_storage.json && python3 src/python_testing/admin_ping_pong.py --commissioning-method on-network --discriminator 2118 --passcode 20202021 --fabric-id 100  --int-arg num_iterations:5

valgrind did not report leaks though. May help if we instrument some things if hey increase but heap gets released at shutdown.

Oct 13 '22 19:10 andy31415

Testing for a ble-wifi device:

rm -f admin_storage.json && python3 src/python_testing/admin_ping_pong.py --commissioning-method ble-wifi --wifi-ssid $SSID --wifi-passphrase $PW -d 3840 --passcode 20202022

(I did not yet get this to pass though... still trying to figure out why)

Oct 13 '22 19:10 andy31415

Above did not work because passcode is 20202021 for devkitc. This seems to work:

rm -f admin_storage.json && \
python3 src/python_testing/admin_ping_pong.py \
    --commissioning-method ble-wifi --wifi-ssid $SSID --wifi-passphrase $PW \
    -d 3840 --passcode 20202021

Oct 13 '22 20:10 andy31415

Ran for a while, got a out of queue space:

<0x1b>[0;31mE (292809) chip[DL]: Long dispatch time: 131 ms, for event type 32781<0x1b>[0m␍␊
<0x1b>[0;32mI (292819) chip[EM]: >>> [E:48615r M:210396802 (Ack:246051861)] (S) Msg RX from 29:000000000000006F [FDEF] --- Type 0000:10 (SecureChannel:StandaloneAck)<0x1b>[0m␍␊
<0x1b>[0;32mI (292839) chip[EM]: >>> [E:48616r M:210396803] (S) Msg RX from 29:000000000000006F [FDEF] --- Type 0001:02 (IM:ReadRequest)<0x1b>[0m␍␊
<0x1b>[0;31mE (292859) chip[DL]: Failed to post event to CHIP Platform event queue<0x1b>[0m␍␊
<0x1b>[0;31mE (292859) chip[CSL]: Failed to queue CHIP System Layer lambda event: 1000000<0x1b>[0m␍␊

Oct 13 '22 20:10 andy31415

I do not see heap going lower:

...
(215899) light-app-callbacks: Current free heap: 152060␍␊
(216069) app-devicecallbacks: Current free heap: 148956␍␊
(216679) light-app-callbacks: Current free heap: 153068␍␊
(225459) light-app-callbacks: Current free heap: 152060␍␊
(225629) app-devicecallbacks: Current free heap: 148944␍␊
(226289) light-app-callbacks: Current free heap: 152948␍␊
(235069) light-app-callbacks: Current free heap: 151892␍␊
(235239) app-devicecallbacks: Current free heap: 148816␍␊
(235859) light-app-callbacks: Current free heap: 153096␍␊
(244629) light-app-callbacks: Current free heap: 151900␍␊
(244799) app-devicecallbacks: Current free heap: 148836␍␊
(245419) light-app-callbacks: Current free heap: 152936␍␊
(254189) light-app-callbacks: Current free heap: 151896␍␊
(254359) app-devicecallbacks: Current free heap: 148828␍␊
(255049) light-app-callbacks: Current free heap: 152972␍␊
(263819) light-app-callbacks: Current free heap: 151900␍␊
(263989) app-devicecallbacks: Current free heap: 148828␍␊
(264599) light-app-callbacks: Current free heap: 153068␍␊
(273429) light-app-callbacks: Current free heap: 150344␍␊
(273599) app-devicecallbacks: Current free heap: 148820␍␊
(274209) light-app-callbacks: Current free heap: 152952␍␊
(283009) light-app-callbacks: Current free heap: 152032␍␊
(283179) app-devicecallbacks: Current free heap: 148964␍␊
(283789) light-app-callbacks: Current free heap: 153052␍␊
(292639) light-app-callbacks: Current free heap: 119088␍␊
(292809) app-devicecallbacks: Current free heap: 114444␍␊

So my device seems to run out of event queue space, likely sigma1 taking CPU.

Oct 13 '22 20:10 andy31415

So my device seems to run out of event queue space, likely sigma1 taking CPU.

Yes, I've also seen this. So far, I don't think this is implicated, but I have yet to reproduce the original issue since adding the instrumentation.

Oct 14 '22 00:10 msandstedt

So my device seems to run out of event queue space, likely sigma1 taking CPU.

Yes, I've also seen this. So far, I don't think this is implicated, but I have yet to reproduce the original issue since adding the instrumentation.

I have also seen this:

I (443411) chip[ZCL]: OpCreds: successfully created fabric index 0x76 via AddNOC
E (443411) chip[DL]: Failed to post event to CHIP Platform event queue
I (443411) chip[EM]: <<< [E:41848r M:26663948 (Ack:98518830)] (S) Msg TX to 117:BD48AA13EED92FB9 [F129] --- Type 0001:09 (IM:InvokeCommandResponse)
E (443421) chip[CSL]: Failed to queue CHIP System Layer lambda event: 1000000
I (443431) chip[IN]: (S) Sending msg 26663948 on secure session with LSID: 25276
E (443451) chip[CSL]: PacketBuffer: pool EMPTY.
E (443461) chip[DL]: Failed to post event to CHIP Platform event queue
E (443461) chip[DMG]: Failed to send command response: b
E (443461) chip[CSL]: Failed to queue CHIP System Layer lambda event: 1000000

This appears to be more of the mDNS traffic starving everything else. In this case, the protocol still was sane. No response was sent and the recommissioning failed as expected.

Still waiting to repro...

Oct 14 '22 00:10 msandstedt

OK, here is reproduction. This is the last successful fabric deletion after a recommissioning:

I (45644171) chip[ZCL]: OpCreds: Fabric index 0x68 was removed
I (45645161) chip[DIS]: Updating services using commissioning mode 0
I (45645161) chip[DIS]: CHIP minimal mDNS started advertising.
E (45645211) chip[CTL]: ResponseSender::RemoveQueryResponder: allocated=6
E (45645211) chip[CTL]: OperationalQueryAllocator::~OperationalQueryAllocator: allocated=4
E (45645221) chip[CTL]: ResponseSender::RemoveQueryResponder: allocated=5
E (45645231) chip[CTL]: OperationalQueryAllocator::~OperationalQueryAllocator: allocated=3
E (45645241) chip[CTL]: ResponseSender::RemoveQueryResponder: allocated=4
E (45645251) chip[CTL]: OperationalQueryAllocator::~OperationalQueryAllocator: allocated=2
E (45645251) chip[CTL]: ResponseSender::RemoveQueryResponder: allocated=3
E (45645261) chip[CTL]: OperationalQueryAllocator::~OperationalQueryAllocator: allocated=1
E (45645271) chip[CTL]: ResponseSender::RemoveQueryResponder: allocated=2
E (45645281) chip[CTL]: OperationalQueryAllocator::~OperationalQueryAllocator: allocated=0
I (45645291) chip[DIS]: Advertise operational node 8DA978CE28BF5A69-790B4436644CE621
E (45645291) chip[CTL]: OperationalQueryAllocator::New: allocated=1
E (45645301) chip[CTL]: ResponseSender::AddQueryResponder: allocated=3
I (45645301) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'; instance name: 8DA978CE28BF5A69-790B4436644CE621.
I (45645331) chip[DIS]: mDNS service published: _matter._tcp
I (45645331) chip[DIS]: Advertise operational node A80B62E552A748BB-9A3ACA6EF5CDB8E9
E (45645351) chip[CTL]: OperationalQueryAllocator::New: allocated=2
E (45645351) chip[CTL]: ResponseSender::AddQueryResponder: allocated=4
I (45645361) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'; instance name: A80B62E552A748BB-9A3ACA6EF5CDB8E9.
I (45645391) chip[DIS]: mDNS service published: _matter._tcp
I (45645391) chip[DIS]: Advertise operational node E84E331D7F7A902B-83CBE7CC6A0BD495
E (45645421) chip[CTL]: OperationalQueryAllocator::New: allocated=3
E (45645421) chip[CTL]: ResponseSender::AddQueryResponder: allocated=5
I (45645431) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'; instance name: E84E331D7F7A902B-83CBE7CC6A0BD495.
I (45645471) chip[DIS]: mDNS service published: _matter._tcp
I (45645471) chip[DIS]: Advertise operational node 88457643C469AB4D-972D412B0D3DDA7B
E (45645511) chip[CTL]: OperationalQueryAllocator::New: allocated=4
E (45645511) chip[CTL]: ResponseSender::AddQueryResponder: allocated=6
I (45645521) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'; instance name: 88457643C469AB4D-972D412B0D3DDA7B.
I (45645561) chip[DIS]: mDNS service published: _matter._tcp
I (45645561) chip[ZCL]: OpCreds: RemoveFabric successful

And then, hereafter we are stuck:

--
I (46580311) chip[ZCL]: OpCreds: Fabric index 0x6a was removed
I (46581831) chip[TS]: Pending Last Known Good Time: 2022-10-14T11:31:09
I (46581851) chip[TS]: Previous Last Known Good Time: 2022-10-14T11:29:44
I (46581851) chip[TS]: Reverted Last Known Good Time to previous value
E (46582041) chip[FP]: Warning: metadata not found during delete of fabric 0x6a
E (46582301) chip[ZCL]: OpCreds: failed to delete fabric at index 106: d8
I (46582301) app-devicecallbacks: PostAttributeChangeCallback - Cluster ID: '0x0030', EndPoint ID: '0x00', Attribute ID: '0x0000'
I (46582321) app-devicecallbacks: Unhandled cluster ID: 48

So it seems to be the failure to delete a fabric and, therefore failure to emit the fabric-removed callback, that causes the mDNS advertisements not to be released. Then subsequently when we try advertise again without having restarted the mDNS server, no more slots are available.

Edit: not quite the problem. Metadata wasn't there because we hadn't committed this fabric yet.

Oct 14 '22 13:10 msandstedt

I think I see it:

I (51186171) chip[FS]: Fail-safe timer expired
E (51186171) chip[ZCL]: OpCreds: Got FailSafeTimerExpired
E (51186171) chip[ZCL]: OpCreds: Proceeding to FailSafeCleanup on fail-safe expiry!
E (51186181) chip[FP]: Reverting pending fabric data for fabric 0x6a
I (51186181) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x00000000000F05E3 due to overflow: event priority_level: 1
E (51186431) chip[FP]: Warning: metadata not found during delete of fabric 0x6a
I (51187851) chip[FP]: Fabric (0x6a) deleted.
I (51187851) chip[ZCL]: [OnFabricRemoved] Handling a fabric removal from the door lock server [endpointId=1,fabricIndex=106]
I (51187871) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=1]
I (51187871) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=2]
I (51187881) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=3]
I (51187891) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=4]
I (51187891) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=5]
I (51187911) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=6]
I (51187911) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=7]
I (51187921) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=8]
I (51187931) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=9]
I (51187931) chip[ZCL]: Lock App: LockEndpoint::GetUser [endpoint=1,userIndex=10]
I (51187951) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=1,credentialType=2]
I (51187961) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=2,credentialType=2]
I (51187971) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=3,credentialType=2]
I (51187971) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=4,credentialType=2]
I (51187981) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=5,credentialType=2]
I (51188001) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=6,credentialType=2]
I (51188011) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=7,credentialType=2]
I (51188021) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=8,credentialType=2]
I (51188031) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=9,credentialType=2]
I (51188041) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=10,credentialType=2]
I (51188051) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=1,credentialType=1]
I (51188061) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=2,credentialType=1]
I (51188071) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=3,credentialType=1]
I (51188081) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=4,credentialType=1]
I (51188101) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=5,credentialType=1]
I (51188111) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=6,credentialType=1]
I (51188111) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=7,credentialType=1]
I (51188121) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=8,credentialType=1]
I (51188131) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=9,credentialType=1]
I (51188151) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=10,credentialType=1]
I (51188161) chip[ZCL]: Lock App: LockEndpoint::GetCredential [endpoint=1,credentialIndex=0,credentialType=0]
I (51188171) chip[ZCL]: OpCreds: Fabric index 0x6a was removed
I (51188981) chip[TS]: Pending Last Known Good Time: 2022-10-14T12:47:22
I (51189001) chip[TS]: Previous Last Known Good Time: 2022-10-14T11:29:44
I (51189001) chip[TS]: Reverted Last Known Good Time to previous value
E (51189201) chip[FP]: Warning: metadata not found during delete of fabric 0x6a
E (51189501) chip[ZCL]: OpCreds: failed to delete fabric at index 106: d8

We don't appear to be restarting the mDNS server on fabric removal for fail safe expiry. In this case, the new fabric was being added to the last available 5th slot. This means that though we've removed the fabric, we haven't removed its advertisement. So if we try to commission a new fabric, mDNS advertisement will always fail because the mDNS server hasn't been restarted and its advertisement slots are already full.

For our part, we need to be using the already-known IP address. This might in theory allow us to side-step this bug. But it is still a bug because a newly-added fabric can't be advertised.

Oct 14 '22 14:10 msandstedt

Further confirmed. The device in question is continuing to advertise a stale identity after fail-safe expiry:

avahi-browse _matter._tcp 
+ enp89s0 IPv6 95D5F7D9D34A6FFB-69FB7307CDAF2046             _matter._tcp         local

95D5F7D9D34A6FFB-69FB7307CDAF2046 is the operational instance name for the node / fabric that failed commissioning with fail-safe timeout.

Oct 14 '22 14:10 msandstedt

connectedhomeip connectedhomeip copied to clipboard

[BUG] Fail-safe expiry continues to advertise operational instance name for reverted fabric

Reproduction steps

Bug prevalence

GitHub hash of the SDK that was being used

Platform

Platform Version(s)

Anything else?

connectedhomeip
connectedhomeip copied to clipboard