connectedhomeip icon indicating copy to clipboard operation
connectedhomeip copied to clipboard

Commissioning timeout not reported to application

Open caipiblack opened this issue 3 years ago • 10 comments

Problem

Commit: b0af6ba42b6ce494748f18ecf3e054a400b8f45b (Tue Jan 4 02:17:19 2022 +0100)

When I start a BLE-WIFI commissioning with incorrect ssid or password, there is no timeout or callback triggered in application side and the commissioning sessions is not cleaned.

Code:

        CommissioningParameters params;
         params.SetWifiCredentials(WifiCredentials(
                                chip::ByteSpan(.....), chip::ByteSpan(...)));
        SetupPayload payload;
        QRCodeSetupPayloadParser(qrCode).populatePayload(payload);

        RendezvousParameters rendezvous;
        rendezvous.SetDiscriminator(payload.discriminator);
        rendezvous.SetSetupPINCode(payload.setUpPINCode);
        error = commissioner.PairDevice(mNodeId, rendezvous, params);

When the ssid and password are correct, everything is good. When the ssid and password are wrong the device try to connect to WiFi network and fail, on the commissioner side, OnSessionEstablishmentTimeout() is never called, so StopPairing() and mPairingDelegate->OnPairingComplete(CHIP_ERROR_TIMEOUT) are never called.

The result is:

  1. The application don't know about the timeout (as the callback void OnPairingComplete(CHIP_ERROR err) is not called)
  2. As StopPairing() is also never called FreeRendezvousSession() is never called, so we can't start another pairing until the end of the first one

Some kind of timeout is detected because BLE connexion is closed but the application is not notified and the "RendezvousSession" is not cleaned.

Here some logs:

ble_wifi_wrong_ssid.log

caipiblack avatar Feb 22 '22 13:02 caipiblack

Note that the problem of the FreeRendezvousSession() which is not called, also happens when the commissioning failed for another reason (ex: BLE failure)

So most of the times, when pairing fails we have to reboot the software/stack/commissioner.

caipiblack avatar Mar 09 '22 09:03 caipiblack

@tehampson Did #17778 fix this by any chance?

bzbarsky-apple avatar May 21 '22 04:05 bzbarsky-apple

What I know is that on TE9 commit, we still have an issue when we reproduce what I said but it's possibly not the same problem!

  • Start commissioning a device with wrong SSID or PASSWORD
  • The stack notify a TIMEOUT, during the commissioning stage 16 (kWiFiNetworkEnable) (ok)
  • Wait some seconds and the BLE is disabled
  • Then reset the device and re-try to commission the device
  • We get this kind of error:
[1654084689.145597][26178:26178] CHIP:-: ../../src/controller/CHIPDeviceController.cpp:679: CHIP Error 0x00000003: Incorrect state at ../../src/controller/SetUpCodePairer.cpp:232

caipiblack avatar Jun 01 '22 12:06 caipiblack

That error happens if someone calls DeviceCommissioner::Commission for a device id that we don't have a PASE session to and aren't trying to set up a PASE session for.

@caipiblack is your code still calling into this API:

CHIP_ERROR DeviceCommissioner::PairDevice(NodeId remoteDeviceId, RendezvousParameters & rendezvousParams,
                                          CommissioningParameters & commissioningParams)

? Are you seeing "Invalid device for commissioning" before that error? In general, attaching (not pasting, please) a log would be useful here.

bzbarsky-apple avatar Jun 02 '22 05:06 bzbarsky-apple

This is how the commissioning is made in my application:

CHIP_ERROR CommandPairing::PairWithCode(NodeId remoteId)
{
    CommissioningParameters commissioningParams = GetCommissioningParameters();
    return CurrentCommissioner().PairDevice(remoteId, mOnboardingPayload, commissioningParams);
}

Yes when I re-start the commissioning I get the error Invalid device for commissioning 0000000000000002

Note that the NodeId is incremented between the two tries and the device is resetted.

Attached you will find full logs: logs-double-commissioning.txt

And here the logs of the second comissioning:

INFO: RX: {"id":0,"jsonrpc":"2.0","method":"matter","params":{"params":{"payload":"MT:-24J042C00KA0648G00","ssid":"TP_9DEA","password":"1234567890"},"cluster":"pairing","command":"qrcode"}}
INFO: TX: {"id":0,"jsonrpc":"2.0","result":{"status":0}}
INFO: Executing command (0)...
DEBUG: CommandPairing::RunCommand(mSSID='TP_9DEA', mPassword='1234567890', mOperationalDataset='???', mOnboardingPayload='MT:-24J042C00KA0648G00')
[1654155261.573464][6987:6987] CHIP:CTL: Setting wifi credentials from parameters
[1654155261.573469][6987:6987] CHIP:CTL: Setting attestation nonce to random value
[1654155261.573480][6987:6987] CHIP:CTL: Setting CSR nonce to random value
[1654155261.573498][6987:6987] CHIP:CTL: Starting commissioning discovery over BLE
[1654155261.573502][6987:6987] CHIP:CTL: Starting commissioning discovery over DNS-SD
[1654155261.573577][6987:6987] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on enp1s0:  ../../src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable
[1654155261.573701][6987:6987] CHIP:DIS: mDNS broadcast had only partial success: 5 successes and 1 failures.
[1654155261.578591][6987:6994] CHIP:BLE: BLE removing known devices.
[1654155261.579031][6987:6994] CHIP:BLE: BLE initiating scan.
[1654155262.170613][6987:6994] CHIP:BLE: Device 35:A3:94:E8:64:89 does not look like a CHIP device.
[1654155262.171921][6987:6994] CHIP:BLE: Device 5C:9C:09:BF:39:7B does not look like a CHIP device.
[1654155262.175959][6987:6994] CHIP:BLE: Device E6:C9:F8:09:50:29 does not look like a CHIP device.
[1654155262.184222][6987:6994] CHIP:BLE: Device 78:7F:87:8B:8C:D9 does not look like a CHIP device.
[1654155262.186234][6987:6994] CHIP:BLE: New device scanned: 7E:1E:1E:79:4F:F7
[1654155262.186254][6987:6994] CHIP:BLE: Device discriminator match. Attempting to connect.
[1654155262.198361][6987:6994] CHIP:BLE: Scan complete notification without an active scan.
[1654155262.305227][6987:6994] CHIP:DL: ConnectDevice complete
[1654155262.573471][6987:6987] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on enp1s0:  ../../src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable
[1654155262.573965][6987:6987] CHIP:DIS: mDNS broadcast had only partial success: 4 successes and 1 failures.
[1654155263.125257][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service0006
[1654155263.125293][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125310][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service0006
[1654155263.125316][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125326][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service0006
[1654155263.125332][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125349][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125356][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125369][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125375][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125408][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125416][6987:6994] CHIP:DL: Char1 /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7/service000a
[1654155263.125435][6987:6994] CHIP:DL: New BLE connection 0x7f52c403cf70, device 7E:1E:1E:79:4F:F7, path /org/bluez/hci0/dev_7E_1E_1E_79_4F_F7
[1654155263.125589][6987:6987] CHIP:DL: HandlePlatformSpecificBLEEvent 16385
[1654155263.125624][6987:6987] CHIP:CTL: Discovered device to be commissioned over BLE
[1654155263.125633][6987:6987] CHIP:CTL: Stopping commissioning discovery over DNS-SD
[1654155263.125660][6987:6987] CHIP:CTL: Attempting PASE connection to BLE
[1654155263.125671][6987:6987] CHIP:CTL: Stopping commissioning discovery over DNS-SD
DEBUG: CommandPairing::OnPairingComplete(error=0)
[1654155263.125683][6987:6987] CHIP:TOO: Pairing Success
[1654155263.125690][6987:6987] CHIP:CTL: Invalid device for commissioning 0000000000000002
[1654155263.125710][6987:6987] CHIP:-: ../../src/controller/CHIPDeviceController.cpp:679: CHIP Error 0x00000003: Incorrect state at ../../src/controller/SetUpCodePairer.cpp:232
[1654155264.574502][6987:6987] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on enp1s0:  ../../src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable
[1654155264.574998][6987:6987] CHIP:DIS: mDNS broadcast had only partial success: 4 successes and 1 failures.
[1654155268.575997][6987:6987] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on enp1s0:  ../../src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable
[1654155268.576552][6987:6987] CHIP:DIS: mDNS broadcast had only partial success: 4 successes and 1 failures.
[1654155276.581028][6987:6987] CHIP:DIS: Warning: Attempt to mDNS broadcast failed on enp1s0:  ../../src/inet/UDPEndPointImplSockets.cpp:417: OS Error 0x02000065: Network is unreachable
[1654155276.581578][6987:6987] CHIP:DIS: mDNS broadcast had only partial success: 4 successes and 1 failures.
[1654155292.582008][6987:6987] CHIP:DIS: Timeout waiting for mDNS resolution.
INFO: OnActionTerminated: (0) isPersistent: false, payload: 'null', result: (50) /home/mathieu/Projects/OTODO/matter-daemon/src/Matter/Commands/Common/CommandBase.cpp:11: CHIP Error 0x00000032: Timeout
INFO: TX: {"jsonrpc":"2.0","method":"on-command-terminated","params":{"chipError":{"message":"/home/mathieu/Projects/OTODO/matter-daemon/src/Matter/Commands/Common/CommandBase.cpp:11: CHIP Error 0x00000032: Timeout","status":50},"id":0,"persistent":false,"status":1}}
INFO: Command (0) terminated on error (/home/mathieu/Projects/OTODO/matter-daemon/src/Matter/Commands/Common/CommandBase.cpp:11: CHIP Error 0x00000032: Timeout).
DEBUG: Checking persistent commands, size: 0

Edit: It can be reproduced using linux chip-tool on TE9 commit with lighting-app example for ESP32:

./chip-tool interactive start
... wait for stack init ...
... Reset the device and wait for the device to be ready ...
>>> pairing qrcode-wifi 1 TP_9DEA 8234567890 MT:-24J042C00KA0648G00
... wait for the first commissioning to fail and BLE is stopped ...
... Reset the device and wait for the device to be ready ...
>>> pairing qrcode-wifi 2 TP_9DEA 8234567890 MT:-24J042C00KA0648G00

caipiblack avatar Jun 02 '22 07:06 caipiblack

Edit: It can be reproduced using linux chip-tool on TE9 commit with lighting-app example for ESP32:

Perfect. This indeed reproduces things for me.

So what happens is that we land in DeviceCommissioner::EstablishPASEConnection for the second commissioning attempt. Our peer address is "BLE", which is basically a singleton as far as I can tell. We do FindCommissioneeDevice(peerAddress) and find a non-null thing, left over from the previous commissioning attempt. The reason that's still there is that we failed after AddNOC, so we can't just create a new PASE session to the same device, since it won't let us AddNOC again. So the PASE session remains open even on failure. So far so good.

Then we check current->IsSecureConnected() and get true. Presumably we have no way to actually tell that the BLE session was closed? @tcarmelveilleux do you know who might know about those bits? Anyway, this is the first place where we could reasonably recover, if we knew that in fact this BLE connection is dead.

Since IsSecureConnected() is true, EstablishPASEConnection decides we have a PASE session.

Now we call Commission. This looks for a commissionee device for the given device id (2 in this case). It does not find one, because the one we had before was for device id 1. So it errors out as described above.

Now if on the second commissioning we passed in node id 1, not 2, then we would get past this point but then we still think we have a PASE session, but we actually do not, that would time out.

And if I don't reset the device, so the BLE session is still alive, I get https://github.com/project-chip/connectedhomeip/issues/19138 which needs to get sorted out separately.

Anyway, one obvious thing we should do is drop the existing device in EstablishPASEConnection if the device id does not match, since in that case it's not the device we are looking for. But that will not help with the case when the caller does in fact pass in the same device id again...

bzbarsky-apple avatar Jun 03 '22 06:06 bzbarsky-apple

https://github.com/project-chip/connectedhomeip/pull/19141 makes the "try again with a different device id" case work, at least.

bzbarsky-apple avatar Jun 03 '22 06:06 bzbarsky-apple

Today during the commissioning of a thread device it has failed at the step 18:

../../src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:174: CHIP Error 0x00000032: Timeout

Once we get this error, all subsequent pairing was failing:

../../src/controller/CHIPDeviceController.cpp:679: CHIP Error 0x00000003: Incorrect state at ../../src/controller/SetUpCodePairer.cpp:232

But now I think it's more related to this issue generic issue : https://github.com/project-chip/connectedhomeip/issues/16868

caipiblack avatar Jun 07 '22 18:06 caipiblack

Once we get this error, all subsequent pairing was failing:

When trying to pair with the same node id, or a different one?

Note that the comment in #16868 talks about the "TE9 commit", which does not have the fix from #19141.... Again, that fix should make things work as long as a new node id is used.

bzbarsky-apple avatar Jun 08 '22 01:06 bzbarsky-apple

I can confirm that your fix solve the problem in my case (as I increment the nodeId on each commissioning even if it fails).

caipiblack avatar Jun 09 '22 13:06 caipiblack

Issue Scrub: Looks like this is resolved, closing. Please re-open if this is still an issue.

woody-apple avatar Nov 02 '22 17:11 woody-apple