connectedhomeip
connectedhomeip copied to clipboard
Commission failure with chip-tool: Failed to perform commissioning step 18
I add little debug info in log. Really want to know the root cause of this issue. Seems to be related to mDNS but I can't figure it out. Thanks in advance.
Problem
[1653361528.149428][761596:761601] CHIP:DIS: kkk send ipv6 err=../examples/chip-tool/third_party/connectedhomeip/src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable, port=5353 [1653361528.149530][761596:761601] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on docker0: ../examples/chip-tool/third_party/connectedhomeip/src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable [1653361528.149553][761596:761601] CHIP:DIS: mDNS broadcast had only partial success: 3 successes and 1 failures. [1653361536.150372][761596:761601] CHIP:DIS: Checking node lookup status after 15004 ms [1653361536.150434][761596:761601] CHIP:DIS: Operational discovery failed for 0x0000000000BC5C01: ../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout [1653361536.150450][761596:761601] CHIP:CTL: OperationalDeviceProxy[1AA8E620B30215AF:0000000000BC5C01]: State change 2 --> 1 [1653361536.150477][761596:761601] CHIP:CTL: Device connection failed. Error ../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout [1653361536.150536][761596:761601] CHIP:CTL: Error on commissioning step 'FindOperational': '../examples/chip-tool/third_party/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout' [1653361536.150558][761596:761601] CHIP:CTL: Failed to perform commissioning step 18
Proposed Solution
I uploaded the entire log for both chip-tool and my board log of Matter's all-cluster-app. Appreciate if anyone can help on it - i'm blocking here for 1+ wks on my new hardware platform(not any of the platforms under src/platform).
nt_board_mdnslog0524.log chip-tool-log-latestVersionModify.txt
It seems that chip-tool start wrong discovery service. Per spec. chapter 5.5 commissioning flows, in step 13 it said "Finalization of the Commissioning process begins. An Administrator configured in the ACL of the Commissionee by the Commissioner SHALL use Operational Discovery to discover the Commissionee". But why my chip-tool starts a wrong discovery with wrong format of only fabric ID (no "-" + node ID)?
[1653361520.911726][761596:761601] CHIP:DIS: MINMDNS: received SRV record for 714DC7B3A6E40F69._matterc._udp.local [1653361520.911801][761596:761601] CHIP:DIS: MINMDNS: Commission SRV for instance 714DC7B3A6E40F69: 303132336161.local
That's commissioning discovery. And that's just the packets we are getting.
The real issue is the operational discovery here:
[1653361521.146675][761596:761601] CHIP:DIS: MINMDNS: Sending query ANY/UNICAST for 1AA8E620B30215AF-0000000000BC5C01._matter._tcp.local
and then that ends up timing out.
The device log does say:
[2022/5/24 11:05:20] chip[DIS]Advertise operational node 1AA8E620B30215AF-0000000000BC5C01
so that part matches; the question is why discovery fails.
@andy31415
Needs more debugging I guess - have to figure out:
- does the device receive the ANY/UNICAST query?
- does the device respond to the ANY/UNICAST query?
- does the controller send a ANY/MULTICAST query if the initial response is not received
- does the controller receive any of the mdns packets (assuming any were sent)?
Generally some logs may help, but if those do not you will need to setup some wireshark/pcap packet captures.
Other ideas: since I see docker0 in errors, I assume you are running on a linux machine for chip-tool (controller). In that case, you may want to check firewalls: I have seen instances where firewalls are blocking 5353 for IPv6, so using IPv6-only chip-tool makes things not work.
So disable firewalls and get a wireshark packet capture (capture port 5353: udp port 5353
, likely on any
for interfaces)
Thanks, @bzbarsky-apple @andy31415 Your advice is very helpful and I'm trying to understand the issue:
- Network data path is ok. No firewall.
- Agree with bzbarsky "The real issue is the operational discovery". From both device/sniffer log, after "CHIP minimal mDNS configured as 'Operational device'" device will broadcast mDNS reply, is this boardcast mNDS right? Not unicast? see below packet analysis of [2022/5/26 16:14:48].
- From below packet analysis, even device received ANY/QU query from chip-tool, the device doesn't respond query response due to "mResponseBuilder.HasPacketBuffer=0". Why? Where should this packet be released?
Packet Analysis based attached sniffer/log. [2022/5/26 16:14:17] chip[DIS]Directly sending mDns reply to peer 192.168.31.162 on port 60359 --> pkt NO.138 [2022/5/26 16:14:17] chip[DIS]Directly sending mDns reply to peer FE80::A28B:2B3B:695A:95C0 on port 48314 --> pkt NO. 142
[2022/5/26 16:14:33] chip[DIS]Broadcasting mDns reply for query from FE80::3231:32FF:FE33:6161 To [2022/5/26 16:14:33] chip[DIS]Broadcasting mDns reply for query from 192.168.31.121 ---> pkt No: 361~366, total 6 packets.
[2022/5/26 16:14:48] chip[DIS]Broadcasting mDns reply for query from FE80::3231:32FF:FE33:6161 --> pkt No. 424 ~ 426, total 3 broadcast packets.
[2022/5/26 16:14:48] chip[DIS]kkk Received an mDNS query from 192.168.31.162 --> from pkt No. 428 The mDNS query is received, but device indicates "mResponseBuilder.HasPacketBuffer()=0". Per log "kkk FlushReply 0 1 UN=1", the first '0' is HasPacketBuffer, the 2nd '1' is HasResponseRecords. 'UN=' means mSendState.SendUnicast.
chip-tool-log-0526mdns.txt nt_board_devcie_0526.log sniffer_0526_mdns.zip
oh... can anyone tell me why in CHIP_ERROR ResponseSender::Respond(), it always go "f (*responder == nullptr)" sending "answer" replies? why mResponders is null from begin to end?
This happens at [2022/5/26 16:14:48] chip[DIS]kkk Received an mDNS query from 192.168.31.162 --> from pkt No. 428.
Seems that my issue is related to "QueryResponderRecordFilter responseFilter". Trying to understand what's the intention of this reply filter and why it doesn't work on my device.
As when I test with SetReplyFilter(nullptr), I can see mDns replies response but the commission still fails because there's no "Operational SRV for fabricID+nodeID".
This failure happens before the next "Sent Sigma1 msg with type=0x30". I can see "OperationalDeviceProxy[1AA8E620B30215AF:0000000000BC5C01]: State change 1 --> 2", but no "State change 2 --> 3".
@andy31415 @bzbarsky-apple would you please help analysis as I'm working on new hardware platform not in any of src/platform. Or where should I ask some help? Any advice is appreciated.
I'm still blocking here. in Respond() in RespondSender.cpp, I found the second "for " will never hit because (*responder)->begin() will be equal to (*responder)->end - it means no finding. But why? what condition do I MISS? The "begin()" is iterator and it finally indicates "mRemaining=0 and mCurrent = nullptr". I also found in previous mDNS response, packet 424/425/426, its additional records has such item which seems to be wrong "xx-xx.". why no specific fabricID+nodeID? XX-XX._matter._tcp.local: type SRV, class IN, priority 0, weight 0, port 5540, target 303132336161.local
@andy31415 Would you please share some comments?
for (auto responder = mResponders.begin(); responder != mResponders.end(); responder++)
{
if (*responder == nullptr)
{
continue;
}
for (auto it = (*responder)->begin(&responseFilter); it != (*responder)->end(); it++)
{ .............never get here
Finally, this issue is fixed and the root cause is some buffer issue which happens on our new hardware platform. I hate C++ as the debugging is not as expected as C language... With some workaround, the commission is completed and chip-tool on/off event can be received. Hope we can move matter on our hardware platform forward. Thanks all.
Finally, this issue is fixed and the root cause is some buffer issue which happens on our new hardware platform. I hate C++ as the debugging is not as expected as C language... With some workaround, the commission is completed and chip-tool on/off event can be received. Hope we can move matter on our hardware platform forward. Thanks all.
i have the same issue, you have fixed the issue?
@mrcao2011 yes. If you are debugging at the same point, I suggest you checking the mDNS response for operation fabricID/nodeID from device for 1st stage because at 2nd stage device will use it do comparison otherwise it would fail.
Finally, this issue is fixed and the root cause is some buffer issue which happens on our new hardware platform. I hate C++ as the debugging is not as expected as C language... With some workaround, the commission is completed and chip-tool on/off event can be received. Hope we can move matter on our hardware platform forward. Thanks all.
Can you specify the buffer issue? what does it mean?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This stale issue has been automatically closed. Thank you for your contributions.