connectedhomeip
connectedhomeip copied to clipboard
Binding manager thinks it can connect to multiple things in parallel, but it can't
Problem
BindingManager::EstablishConnection
does:
mInitParams.mCASESessionManager->FindOrEstablishSession(nodeId, &mOnConnectedCallback, &mOnConnectionFailureCallback);
This will move the mOnConnectedCallback
and mOnConnectionFailureCallback
callbacks to the callback list of the new session establishment process, if there is no session already.
We call BindingManager::NotifyBoundClusterChanged
in a loop over bindings in BindingManager::NotifyBoundClusterChanged
. This means we will only get notified for the last connection we started, not any of the other ones.
In many cases this might be getting covered up by connection attempts that complete synchronously, because we have pre-warmed CASE sessions... But it can lead to messages not being sent to some bindings if we have no pre-warmed sessions around.
Proposed Solution
Move the connection callbacks to somewhere that is per-connection-attempt. Perhaps the pending notification? Or some other data structure?
Why is this labeled 'spec'?
@bzbarsky-apple ?
Because:
But it can lead to messages not being sent to some bindings
which would be a spec violation, no?
Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg
.
I'd suggest to add a queue for the connections in the CASESessionManager.
@gjc13 can you please upload a sample of your test. I might be able to take a look sometime this week, it will be helpful to have a test in hand to reproduce the issue
Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at
waiting for Sigma1 msg
.I'd suggest to add a queue for the connections in the CASESessionManager.
This does not line up with my understanding of this issue at all. CASESessionManager is capable of handling multiple connections at one. You can totally call FindOrEstablishSession
twice and it should resolve without issue and call both callbacks should be called when the connection is established. To me this sounds like potentially a completely different issue that would need to be addressed even before this one.
From my understanding the issue here is that when BindingManager::NotifyBoundClusterChanged
is called twice with the same mPendingNotificationMap.AddPendingNotification
we overwrite the entry since PendingNotificationMap::AddPendingNotification
first removes the entry. The issue is that we don't allow multiple entries in PendingNotificationMap
@tehampson : @gjc13 has left the Matter project. So I wouldn't expect a response from him.
Chatted with @tehampson about this. He's going to run some experiments to see if this issue does indeed exist.
Just want to quickly touch on waiting for Sigma1 msg
. This message is perfectly natural. We always have a CASEClient
that is ready to receive the first Sigma1 message sent by a peer node. This is a none issue.
Based on discussion with Jerry, arguably when someone calls PendingNotificationMap::AddPendingNotification
with a new context the right things to do is to overwrite the entry so this is not a problem.
Based on a code audit there is no issue with calling BindingManager::EstablishConnection
from loop in BindingManager::NotifyBoundClusterChanged
for the following reasons:
- If sessions is not already primed there will be multiple entries in
mPendingNotificationMap
(since the index will be different)- When
BindingManager::HandleDeviceConnected
is eventual called upon session establishing, we iterate through all the pending entries inmPendingNotificationMap
. Outside of the for loop we clear all the entries inmPendingNotificationMap
associated with the ScopedNodeId. - Even if
BindingManager::HandleDeviceConnected
is called a second time there is nothing in themPendingNotificationMap
the second time around.
- When
- If session is already primed, there was already deemed to be no issue (I have confirmed during audit but don't want to spend the effort explaining it here).
To see if there really is an issue in practice I added two hacks:
-
src/app/clusters/bindings/bindings.cpp
to add an entry twice inCreateBindingEntry
- In
BindingManager::NotifyBoundClusterChanged
created a hack to mark a session as defunct so to test out the sequence outlined by Boris.
This sequence I did (I am documenting this for my own future reference):
- Build all clusters app with command
scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false chip_build_libshell=true
-
Terminal 1
runrm -rf /tmp/chip_* && ./out/debug/standalone/chip-all-clusters-app --secured-device-port 72564
(port was picked by mashing keys) -
Terminal 2
run./out/debug/standalone/chip-tool pairing onnetwork 1 20202021
-
Terminal 3
runrm -f kvs1 && out/debug/standalone/chip-all-clusters-app --KVS kvs1
-
Terminal 2
run./out/debug/standalone/chip-tool pairing onnetwork 2 20202021
-
Terminal 2
run./out/debug/standalone/chip-tool binding write binding '[{"fabricIndex": 1, "node": 1, "endpoint": 1, "cluster": 6}]' 2 1
-
Terminal 3
runswitch on
-
Terminal 1
confirm it received notification
What I discovered is that we only ever call BindingManager::EstablishConnection
once from loop in BindingManager::NotifyBoundClusterChanged
because SuccessOrExit(error == CHIP_NO_ERROR);
should actually be SuccessOrExit(error);
While auditing code for this issue I have spotted some use after free and a buffer overrun. So I am going to add these fixes as well to the issue mentioned above. PR should be up soon
@tehampson I'd recommend trying with at least two instances of the 'bulb' that the switch is controlling, and to then have two entries in the bindings cluster pointing to two different nodes.
Also, I'd recommend trying with both the model where the switch boots up first, fails to connect to the two bulbs, THEN, launch the two bulbs (to prevent the initial CASE session establishment from succeeding so that we can observe the establishment process at the time the switch is flipped), as well as the model where the bulbs boot up first, then the switch to observe warmed CASE session usagse.
Okay so when I try talking to two different nodes that are not primed I am getting the issue. The issue has to do with mOnConnectedCallback
and mOnConnectionFailureCallback
that are provided to FindOrEstablishSession
when you call it multiple times waiting for the callback, you have BindingManager::HandleDeviceConnected
only called from the last one registered. This is what Boris was already saying, but I don't think I understood that part until now :man_facepalming:
Right, the key is that a Callback
can only be in a single list, so adding mOnConnectedCallback
and mOnConnectionFailureCallback
to the list for the new session establishment removes them from the old one, and hence they don't get notifications for that one....