connectedhomeip icon indicating copy to clipboard operation
connectedhomeip copied to clipboard

Multi-fabric broken (Ethernet device, minimal mDNS, TC-MF-1.21)

Open sstruchtrup opened this issue 3 years ago • 15 comments
trafficstars

Problem

Multi-Fabric tests (TC-MF-1.21) fails after adding 6..9 additional controllers. This seems to be mDNS related. Observed during TE9 with our DUT, which currently using the minimal mDNS for development.

Reproducible on TE9 tag with the example lighting-app as well as on current master (a777a801c6f0637ee54f83a63d02f662338e8b51).

chip-tool output:

[1652870021.531285][6077:6078] CHIP:DMG: Received Command Response Data, Endpoint=0 Cluster=0x0000_003E Command=0x0000_0008
[1652870021.531311][6077:6078] CHIP:CTL: Device returned status 0 on receiving the NOC
[1652870021.531332][6077:6078] CHIP:CTL: Operational credentials provisioned on device 0x7f5bc0001960
[1652870021.531352][6077:6078] CHIP:TOO: Secure Pairing Success
[1652870021.531373][6077:6078] CHIP:CTL: Successfully finished commissioning step 'SendNOC'
[1652870021.531392][6077:6078] CHIP:CTL: Commissioning stage next step: 'SendNOC' -> 'FindOperational'
[1652870021.531411][6077:6078] CHIP:CTL: Performing next commissioning step 'FindOperational'
[1652870021.531430][6077:6078] CHIP:CSM: FindOrEstablishSession: PeerId = 20014E4F04A750D8:0000000000000001
[1652870021.531451][6077:6078] CHIP:CSM: FindOrEstablishSession: No existing OperationalDeviceProxy instance found
[1652870021.531542][6077:6078] CHIP:DIS: Operational node lookup already in progress. Will NOT start a new one.
[1652870021.531577][6077:6078] CHIP:DMG: ICR moving to [AwaitingDe]
[1652870021.531601][6077:6078] CHIP:EM: Sending Standalone Ack for MessageCounter:9615839 on exchange 59506i
[1652870021.531632][6077:6078] CHIP:IN: Prepared secure message 0x7f5bc4a99180 to 0xFFFFFFFB00000000 (0)  of type 0x10 and protocolId (0, 0) on exchange 59506i with MessageCounter:2930314.
[1652870021.531657][6077:6078] CHIP:IN: Sending encrypted msg 0x7f5bc4a99180 with MessageCounter:2930314 to 0xFFFFFFFB00000000 (0) at monotonic time: 0000000000E6A9D1 msec
[1652870021.531695][6077:6078] CHIP:EM: Flushed pending ack for MessageCounter:9615839 on exchange 59506i
[1652870021.731146][6077:6078] CHIP:DIS: Checking node lookup status after 200 ms
[1652870036.531581][6077:6078] CHIP:DIS: Checking node lookup status after 15000 ms
[1652870036.531678][6077:6078] CHIP:DIS: Operational discovery failed for 0x0000000000000001: ../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout
[1652870036.531882][6077:6078] CHIP:CTL: Device connection failed. Error ../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout
[1652870036.531912][6077:6078] CHIP:CTL: Error on commissioning step 'FindOperational': '../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout'
[1652870036.531927][6077:6078] CHIP:CTL: Failed to perform commissioning step 18
[1652870036.531945][6077:6078] CHIP:CTL: Going from commissioning step 'FindOperational' with lastErr = '../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout' -> 'Cleanup'
[1652870036.531972][6077:6078] CHIP:CTL: Performing next commissioning step 'Cleanup' with completion status = '../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout'
[1652870036.531996][6077:6078] CHIP:CTL: Successfully finished commissioning step 'Cleanup'
[1652870036.532022][6077:6078] CHIP:TOO: Device commissioning Failure: ../../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout

lighting-app does not show anything suspicious with the TE9 tag, but the following on master:

[1652869752.438360][426445:426445] CHIP:ZCL: OpCreds: successfully created fabric index 7 via AddNOC
[1652869752.438372][426445:426445] CHIP:DMG: Decreasing reference count for CommandHandler, remaining 0
[1652869752.438379][426445:426445] CHIP:EM: Piggybacking Ack for MessageCounter:6433823 on exchange: 64301r
[1652869752.438402][426445:426445] CHIP:IN: Prepared secure message 0x55ad4f9e6b30 to 0xFFFFFFFB00000000 (7)  of type 0x9 and protocolId (0, 1) on exchange 64301r with MessageCounter:3112835.
[1652869752.438410][426445:426445] CHIP:IN: Sending encrypted msg 0x55ad4f9e6b30 with MessageCounter:3112835 to 0xFFFFFFFB00000000 (7) at monotonic time: 0000000001220A33 msec
[1652869752.438441][426445:426445] CHIP:DMG: ICR moving to [CommandSen]
[1652869752.438445][426445:426445] CHIP:DMG: ICR moving to [AwaitingDe]
[1652869752.438517][426445:426445] CHIP:EM: Received message of type 0x10 with protocolId (0, 0) and MessageCounter:6433824 on exchange 64300r
[1652869752.438522][426445:426445] CHIP:EM: Found matching exchange: 64300r, Delegate: (nil)
[1652869752.438528][426445:426445] CHIP:EM: Rxd Ack; Removing MessageCounter:3112834 from Retrans Table on exchange 64300r
[1652869752.438530][426445:426445] CHIP:EM: Removed CHIP MessageCounter:3112834 from RetransTable on exchange 64300r
[1652869752.438548][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.438553][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.438567][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.438572][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.438576][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.438579][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.438595][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.438599][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.438611][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.438615][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.438619][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.438622][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.438638][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.438642][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
...
[1652869752.439191][426445:426445] CHIP:EM: Received message of type 0x10 with protocolId (0, 0) and MessageCounter:6433825 on exchange 64301r
[1652869752.439197][426445:426445] CHIP:EM: Found matching exchange: 64301r, Delegate: (nil)
[1652869752.439206][426445:426445] CHIP:EM: Rxd Ack; Removing MessageCounter:3112835 from Retrans Table on exchange 64301r
[1652869752.439211][426445:426445] CHIP:EM: Removed CHIP MessageCounter:3112835 from RetransTable on exchange 64301r
[1652869752.439325][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.439330][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.439333][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.439335][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.439354][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.439357][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
...
[1652869752.439486][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.439502][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.
[1652869752.439518][426445:426445] CHIP:DIS: DNSSD parse error: ../../third_party/connectedhomeip/src/lib/dnssd/IncrementalResolve.cpp:316: CHIP Error 0x0000000B: No memory
[1652869752.439523][426445:426445] CHIP:DIS: DNSSD parse error: ../../third_party/connectedhomeip/src/lib/dnssd/IncrementalResolve.cpp:316: CHIP Error 0x0000000B: No memory
[1652869752.439527][426445:426445] CHIP:DIS: DNSSD parse error: ../../third_party/connectedhomeip/src/lib/dnssd/IncrementalResolve.cpp:316: CHIP Error 0x0000000B: No memory
[1652869752.439530][426445:426445] CHIP:DIS: DNSSD parse error: ../../third_party/connectedhomeip/src/lib/dnssd/IncrementalResolve.cpp:316: CHIP Error 0x0000000B: No memory
[1652869752.439533][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.439535][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.439538][426445:426445] CHIP:DIS: Discovered node without a pending query
[1652869752.439541][426445:426445] CHIP:DIS: Discovery does not require any more timeouts
[1652869752.439554][426445:426445] CHIP:DIS: Insufficient parsers to process all SRV entries.

Out of memory can be ruled out.

Configured via:

gn gen out/host --args="target_os=\"linux\"
        chip_inet_config_enable_ipv4=false
        chip_enable_openthread=false
        chip_enable_ota_requestor=false
        chip_config_network_layer_ble=false
        chip_mdns=\"minimal\"
        chip_enable_ble=false
        chip_enable_wifi=false
        treat_warnings_as_errors=false"

sstruchtrup avatar May 18 '22 10:05 sstruchtrup

This cannot be tested with the Avahi-based mDNS, as this seems to be completely broken at the moment, cf. #18548.

sstruchtrup avatar May 18 '22 10:05 sstruchtrup

Might be related/duplicate of #18256

sstruchtrup avatar May 18 '22 12:05 sstruchtrup

@sstruchtrup could you describe testing steps here? How can I reproduce the issue and what commands do I need to run to reproduce it.

Was this tested on latest or TE9? The latest changes to minmdns that affect device discovery were merged only recently.

andy31415 avatar May 18 '22 20:05 andy31415

@andy31415 This was discovered during TE9 with the TE9 tag, but is also present on current master (commit a777a801c6f0637ee54f83a63d02f662338e8b51).

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

I apply the following patch to support building without Wifi and Thread:

diff --git a/src/platform/Linux/BUILD.gn b/src/platform/Linux/BUILD.gn
index 1095bcebb..64f7eb8cf 100644
--- a/src/platform/Linux/BUILD.gn
+++ b/src/platform/Linux/BUILD.gn
@@ -65,8 +65,6 @@ static_library("Linux") {
     "Logging.cpp",
     "NetworkCommissioningDriver.h",
     "NetworkCommissioningEthernetDriver.cpp",
-    "NetworkCommissioningThreadDriver.cpp",
-    "NetworkCommissioningWiFiDriver.cpp",
     "PlatformManagerImpl.cpp",
     "PlatformManagerImpl.h",
     "PosixConfig.cpp",

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

Then I configure the example lighting app (linux):

gn gen out/host --args="target_os=\"linux\"
        chip_inet_config_enable_ipv4=false
        chip_enable_openthread=false
        chip_enable_ota_requestor=false
        chip_config_network_layer_ble=false
        chip_mdns=\"minimal\"
        chip_enable_ble=false
        chip_enable_wifi=false
        treat_warnings_as_errors=false"

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

Initial pairing:

chip-tool pairing ethernet 1 20202021 3840 fc00:832::1 5540 

Then I use the following script for the additional pairings:

#!/bin/bash

# Change this to the chip-tool you are using
chip_tool=chip-tool-te9

fail()
{
	echo "$*"
	exit 1
}

pair()
{
	n=$1

	(${chip_tool} administratorcommissioning open-basic-commissioning-window 500 1 0  --timedInteractionTimeoutMs 1000 > ocw-${n}.log) || fail "Open commissioning window failed"

	let id=n+10

	(${chip_tool} pairing onnetwork 1 20202021 --commissioner-name ${id} > comm-${n}.log)|| fail "commissioning failed"
}

for s in `seq 1 20`; 
do
	echo "---> ${s}"
	pair $s || exit 1
done

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

The 6th additional pairing failed. I think I observed 9 possible pairings without IPv6.

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

Retested against today's master (45f623b73cbdc4a304579882aeb00163f28fd192) with verbose mDNS logging. This somehow fails on the 5th additional pairing.

Full logs for the lighting-app: log.txt

Failed commissioning starts at timestamp 1652975447.

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

I can get a pcap tomorrow if needed.

sstruchtrup avatar May 19 '22 15:05 sstruchtrup

Logs seem to say:

[1652975447.224246][726359:726359] CHIP:DIS: MINMDNS: received SRV record for 0EB170FCE8A4D304-0000000000000001._matter._tcp.local
[1652975447.224248][726359:726359] CHIP:DIS: Insufficient parsers to process all SRV entries.

andy31415 avatar May 19 '22 16:05 andy31415

For some reason MDNS packets are received with several operational data answers. We limit parallel processing to 2, so operational SRV records fail beyond that. Not sure if there is an ordering issue or something else.

andy31415 avatar May 19 '22 16:05 andy31415

What is unclear to me is what queries are being made - I see MINMDNS reporting received packets, however I never see a log of "MINMDNS: Sending query" which I would expect to see what chip-tool is asking for (and to figure out why it gets so many replies ... because maybe we are asking for the wrong thing).

andy31415 avatar May 19 '22 17:05 andy31415

Unassigning this to give priority to cert blockers for now.

andy31415 avatar May 19 '22 17:05 andy31415

Not observed anymore on the pre-SVE commit 425f1fb916ca154deeb75cd82cd5196778aba152.

sstruchtrup avatar Jun 20 '22 07:06 sstruchtrup

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Apr 25 '23 18:04 stale[bot]

This stale issue has been automatically closed. Thank you for your contributions.

stale[bot] avatar May 08 '23 21:05 stale[bot]