connectedhomeip
connectedhomeip copied to clipboard
Binding manager thinks it can connect to multiple things in parallel, but it can't
Problem
BindingManager::EstablishConnection does:
mInitParams.mCASESessionManager->FindOrEstablishSession(nodeId, &mOnConnectedCallback, &mOnConnectionFailureCallback);
This will move the mOnConnectedCallback and mOnConnectionFailureCallback callbacks to the callback list of the new session establishment process, if there is no session already.
We call BindingManager::NotifyBoundClusterChanged in a loop over bindings in BindingManager::NotifyBoundClusterChanged. This means we will only get notified for the last connection we started, not any of the other ones.
In many cases this might be getting covered up by connection attempts that complete synchronously, because we have pre-warmed CASE sessions... But it can lead to messages not being sent to some bindings if we have no pre-warmed sessions around.
Proposed Solution
Move the connection callbacks to somewhere that is per-connection-attempt. Perhaps the pending notification? Or some other data structure?
Why is this labeled 'spec'?
@bzbarsky-apple ?
Because:
But it can lead to messages not being sent to some bindings
which would be a spec violation, no?
Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg.
I'd suggest to add a queue for the connections in the CASESessionManager.
@gjc13 can you please upload a sample of your test. I might be able to take a look sometime this week, it will be helpful to have a test in hand to reproduce the issue
Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at
waiting for Sigma1 msg.I'd suggest to add a queue for the connections in the CASESessionManager.
This does not line up with my understanding of this issue at all. CASESessionManager is capable of handling multiple connections at one. You can totally call FindOrEstablishSession twice and it should resolve without issue and call both callbacks should be called when the connection is established. To me this sounds like potentially a completely different issue that would need to be addressed even before this one.
From my understanding the issue here is that when BindingManager::NotifyBoundClusterChanged is called twice with the same mPendingNotificationMap.AddPendingNotification we overwrite the entry since PendingNotificationMap::AddPendingNotification first removes the entry. The issue is that we don't allow multiple entries in PendingNotificationMap
@tehampson : @gjc13 has left the Matter project. So I wouldn't expect a response from him.
Chatted with @tehampson about this. He's going to run some experiments to see if this issue does indeed exist.
Just want to quickly touch on waiting for Sigma1 msg. This message is perfectly natural. We always have a CASEClient that is ready to receive the first Sigma1 message sent by a peer node. This is a none issue.
Based on discussion with Jerry, arguably when someone calls PendingNotificationMap::AddPendingNotification with a new context the right things to do is to overwrite the entry so this is not a problem.
Based on a code audit there is no issue with calling BindingManager::EstablishConnection from loop in BindingManager::NotifyBoundClusterChanged for the following reasons:
- If sessions is not already primed there will be multiple entries in
mPendingNotificationMap(since the index will be different)- When
BindingManager::HandleDeviceConnectedis eventual called upon session establishing, we iterate through all the pending entries inmPendingNotificationMap. Outside of the for loop we clear all the entries inmPendingNotificationMapassociated with the ScopedNodeId. - Even if
BindingManager::HandleDeviceConnectedis called a second time there is nothing in themPendingNotificationMapthe second time around.
- When
- If session is already primed, there was already deemed to be no issue (I have confirmed during audit but don't want to spend the effort explaining it here).
To see if there really is an issue in practice I added two hacks:
src/app/clusters/bindings/bindings.cppto add an entry twice inCreateBindingEntry- In
BindingManager::NotifyBoundClusterChangedcreated a hack to mark a session as defunct so to test out the sequence outlined by Boris.
This sequence I did (I am documenting this for my own future reference):
- Build all clusters app with command
scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false chip_build_libshell=true Terminal 1runrm -rf /tmp/chip_* && ./out/debug/standalone/chip-all-clusters-app --secured-device-port 72564(port was picked by mashing keys)Terminal 2run./out/debug/standalone/chip-tool pairing onnetwork 1 20202021Terminal 3runrm -f kvs1 && out/debug/standalone/chip-all-clusters-app --KVS kvs1Terminal 2run./out/debug/standalone/chip-tool pairing onnetwork 2 20202021Terminal 2run./out/debug/standalone/chip-tool binding write binding '[{"fabricIndex": 1, "node": 1, "endpoint": 1, "cluster": 6}]' 2 1Terminal 3runswitch onTerminal 1confirm it received notification
What I discovered is that we only ever call BindingManager::EstablishConnection once from loop in BindingManager::NotifyBoundClusterChanged because SuccessOrExit(error == CHIP_NO_ERROR); should actually be SuccessOrExit(error);
While auditing code for this issue I have spotted some use after free and a buffer overrun. So I am going to add these fixes as well to the issue mentioned above. PR should be up soon
@tehampson I'd recommend trying with at least two instances of the 'bulb' that the switch is controlling, and to then have two entries in the bindings cluster pointing to two different nodes.
Also, I'd recommend trying with both the model where the switch boots up first, fails to connect to the two bulbs, THEN, launch the two bulbs (to prevent the initial CASE session establishment from succeeding so that we can observe the establishment process at the time the switch is flipped), as well as the model where the bulbs boot up first, then the switch to observe warmed CASE session usagse.
Okay so when I try talking to two different nodes that are not primed I am getting the issue. The issue has to do with mOnConnectedCallback and mOnConnectionFailureCallback that are provided to FindOrEstablishSession when you call it multiple times waiting for the callback, you have BindingManager::HandleDeviceConnected only called from the last one registered. This is what Boris was already saying, but I don't think I understood that part until now :man_facepalming:
Right, the key is that a Callback can only be in a single list, so adding mOnConnectedCallback and mOnConnectionFailureCallback to the list for the new session establishment removes them from the old one, and hence they don't get notifications for that one....