connectedhomeip Binding manager thinks it can connect to multiple things in parallel, but it can't

Problem

BindingManager::EstablishConnection does:

    mInitParams.mCASESessionManager->FindOrEstablishSession(nodeId, &mOnConnectedCallback, &mOnConnectionFailureCallback);

This will move the mOnConnectedCallback and mOnConnectionFailureCallback callbacks to the callback list of the new session establishment process, if there is no session already.

We call BindingManager::NotifyBoundClusterChanged in a loop over bindings in BindingManager::NotifyBoundClusterChanged. This means we will only get notified for the last connection we started, not any of the other ones.

In many cases this might be getting covered up by connection attempts that complete synchronously, because we have pre-warmed CASE sessions... But it can lead to messages not being sent to some bindings if we have no pre-warmed sessions around.

Proposed Solution

Move the connection callbacks to somewhere that is per-connection-attempt. Perhaps the pending notification? Or some other data structure?

Aug 03 '22 19:08 bzbarsky-apple

Why is this labeled 'spec'?

Aug 09 '22 17:08 mrjerryjohns

@bzbarsky-apple ?

Aug 09 '22 18:08 woody-apple

Because:

But it can lead to messages not being sent to some bindings

which would be a spec violation, no?

Aug 09 '22 19:08 bzbarsky-apple

Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg.

I'd suggest to add a queue for the connections in the CASESessionManager.

Aug 17 '22 06:08 gjc13

@gjc13 can you please upload a sample of your test. I might be able to take a look sometime this week, it will be helpful to have a test in hand to reproduce the issue

Aug 22 '22 18:08 tehampson

Had a quick test on linux platform. If we intitiate two concurrent CASE sessions, the second connection will get stuck at waiting for Sigma1 msg.

I'd suggest to add a queue for the connections in the CASESessionManager.

This does not line up with my understanding of this issue at all. CASESessionManager is capable of handling multiple connections at one. You can totally call FindOrEstablishSession twice and it should resolve without issue and call both callbacks should be called when the connection is established. To me this sounds like potentially a completely different issue that would need to be addressed even before this one.

From my understanding the issue here is that when BindingManager::NotifyBoundClusterChanged is called twice with the same mPendingNotificationMap.AddPendingNotification we overwrite the entry since PendingNotificationMap::AddPendingNotification first removes the entry. The issue is that we don't allow multiple entries in PendingNotificationMap

Aug 23 '22 17:08 tehampson

@tehampson : @gjc13 has left the Matter project. So I wouldn't expect a response from him.

Aug 23 '22 17:08 franck-apple

Chatted with @tehampson about this. He's going to run some experiments to see if this issue does indeed exist.

Aug 23 '22 21:08 mrjerryjohns

Just want to quickly touch on waiting for Sigma1 msg. This message is perfectly natural. We always have a CASEClient that is ready to receive the first Sigma1 message sent by a peer node. This is a none issue.

Based on discussion with Jerry, arguably when someone calls PendingNotificationMap::AddPendingNotification with a new context the right things to do is to overwrite the entry so this is not a problem.

Based on a code audit there is no issue with calling BindingManager::EstablishConnection from loop in BindingManager::NotifyBoundClusterChanged for the following reasons:

If sessions is not already primed there will be multiple entries in mPendingNotificationMap (since the index will be different)
- When BindingManager::HandleDeviceConnected is eventual called upon session establishing, we iterate through all the pending entries in mPendingNotificationMap. Outside of the for loop we clear all the entries in mPendingNotificationMap associated with the ScopedNodeId.
- Even if BindingManager::HandleDeviceConnected is called a second time there is nothing in the mPendingNotificationMap the second time around.
If session is already primed, there was already deemed to be no issue (I have confirmed during audit but don't want to spend the effort explaining it here).

To see if there really is an issue in practice I added two hacks:

src/app/clusters/bindings/bindings.cpp to add an entry twice in CreateBindingEntry
In BindingManager::NotifyBoundClusterChanged created a hack to mark a session as defunct so to test out the sequence outlined by Boris.

This sequence I did (I am documenting this for my own future reference):

Build all clusters app with command scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false chip_build_libshell=true
Terminal 1 run rm -rf /tmp/chip_* && ./out/debug/standalone/chip-all-clusters-app --secured-device-port 72564 (port was picked by mashing keys)
Terminal 2 run ./out/debug/standalone/chip-tool pairing onnetwork 1 20202021
Terminal 3 run rm -f kvs1 && out/debug/standalone/chip-all-clusters-app --KVS kvs1
Terminal 2 run ./out/debug/standalone/chip-tool pairing onnetwork 2 20202021
Terminal 2 run ./out/debug/standalone/chip-tool binding write binding '[{"fabricIndex": 1, "node": 1, "endpoint": 1, "cluster": 6}]' 2 1
Terminal 3 run switch on
Terminal 1 confirm it received notification

What I discovered is that we only ever call BindingManager::EstablishConnection once from loop in BindingManager::NotifyBoundClusterChanged because SuccessOrExit(error == CHIP_NO_ERROR); should actually be SuccessOrExit(error);

While auditing code for this issue I have spotted some use after free and a buffer overrun. So I am going to add these fixes as well to the issue mentioned above. PR should be up soon

Aug 24 '22 15:08 tehampson

@tehampson I'd recommend trying with at least two instances of the 'bulb' that the switch is controlling, and to then have two entries in the bindings cluster pointing to two different nodes.

Also, I'd recommend trying with both the model where the switch boots up first, fails to connect to the two bulbs, THEN, launch the two bulbs (to prevent the initial CASE session establishment from succeeding so that we can observe the establishment process at the time the switch is flipped), as well as the model where the bulbs boot up first, then the switch to observe warmed CASE session usagse.

Aug 24 '22 17:08 mrjerryjohns

Okay so when I try talking to two different nodes that are not primed I am getting the issue. The issue has to do with mOnConnectedCallback and mOnConnectionFailureCallback that are provided to FindOrEstablishSession when you call it multiple times waiting for the callback, you have BindingManager::HandleDeviceConnected only called from the last one registered. This is what Boris was already saying, but I don't think I understood that part until now :man_facepalming:

Aug 24 '22 18:08 tehampson

Right, the key is that a Callback can only be in a single list, so adding mOnConnectedCallback and mOnConnectionFailureCallback to the list for the new session establishment removes them from the old one, and hence they don't get notifications for that one....

Aug 26 '22 19:08 bzbarsky-apple

connectedhomeip connectedhomeip copied to clipboard

Binding manager thinks it can connect to multiple things in parallel, but it can't

Problem

Proposed Solution

connectedhomeip
connectedhomeip copied to clipboard