FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Propagate disconnected reason during 'disconnected' event

Open vivekmogalapalli opened this issue 2 years ago • 5 comments

Description

The container emits a disconnected event in event of a connection loss but it doesn't emit any additional details about why the connection was disconnected. This information is already available in the container in the form of reason and is also logged in Fluid Framework's telemetry but isn't accessible to the host that integrates with the Fluid Framework.

This change propagates reason when a disconnected event is raised by the container.

Reviewer Guidance

We initially started off with passing the reason as an extra parameter through each of the methods being called as we progressed through the stack Here is the draft PR for the alternative approach.

The changes in the initial approach above were much more minor but it contrasted with a current design pattern. Currently, the ConnectionState is encapsulated within ConnectionStateHandler and the Container queries the ConnectionStateHandler to get the latest connection state in event of a request to propagateConnectionState(). [Link to the specific location where this happens]

Pair programmed with @markfields and introduced ConnectionStateDetails to ensure we follow the same design pattern.

Does this introduce a breaking change?

No

Other information or known dependencies

  • The Microsoft Loop team is relying on this change to measure and improve connectivity reliability across Microsoft Loop experiences
  • The next steps here would be to instrument in the host experience cases where we don't restore connection within an appropriate time and keep track of the disconnected reason that contribute to the loss of connection

vivekmogalapalli avatar Jul 23 '22 03:07 vivekmogalapalli

@fluid-example/bundle-size-tests: +308 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 387.49 KB 387.49 KB +4 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 191.6 KB 191.61 KB +9 Bytes
loader.js 151.52 KB 151.81 KB +295 Bytes
map.js 42.67 KB 42.67 KB No change
matrix.js 127.15 KB 127.15 KB No change
odspDriver.js 149.26 KB 149.26 KB No change
odspPrefetchSnapshot.js 38.31 KB 38.31 KB No change
sharedString.js 147.79 KB 147.79 KB No change
Total Size 1.24 MB 1.24 MB +308 Bytes

Baseline commit: 38d97ef28e919b42579a3b897f70a7060b599275

Generated by :no_entry_sign: dangerJS against b8b359ca55046c9115997e03f12b2c2a389a1271

msfluid-bot avatar Jul 23 '22 03:07 msfluid-bot

@vladsud @scottn12 - Can you please take a look at this and help Vivek get this in? Thanks!

markfields avatar Jul 24 '22 14:07 markfields

I'm not sure I like this approach - I think you are overcomplicating it. I think it's Ok not to have a reason for disconnected state when container just loads (i.e. this.propagateConnectionState() calls at the end of boot sequence). If that's the case, then propagating disconnect reason for all disconnects becomes trivial and does not need this change. All you need is something I'm doing here (this is on top of Mark's PR that is already merged, i.e. only last commit is interesting) -https://github.com/microsoft/FluidFramework/pull/10958 Basically merging connectionStateChanged and logConnectionStateChangeTelemetry callbacks together, and you get a reason to propagate!

If I got it right (and it addresses you concerns), I'd rather go this way, as it's much simpler, does not add new concepts / tracking.

vladsud avatar Jul 27 '22 15:07 vladsud

@vladsud - Let me chime in and respond to your overall feedback.

I'm not sure I like this approach - I think you are overcomplicating it.

If this is overcomplicated, I'm to blame :)

Basically merging connectionStateChanged and logConnectionStateChangeTelemetry callbacks together, and you get a reason to propagate!

I agree we could merge those two, but let's set that aside since it's extra beyond what Vivek needs.

I think what you're saying is that we can easily just pass reason up through Container.propagateConnectionState via connectionStateChanged callback. This simple solution is precisely what Vivek started with - see #11261.

When he reviewed that privately with me, here was my concern that lead me to recommend this more complex solution -- If you look at Container.propagateConnectionState in that other PR, you'll notice that state and reason are decoupled. state is read off ConnectionStateHandler, but reason is passed in. In practice, it's fine, they are correlated. But I thought the approach in this PR was more rigorous/correct - it keeps state and reason coupled together in one object.

If we want to set aside this design concern, then I'm fine with the simpler PR. In fact I'm signing off on it now. @vivekmogalapalli if you don't mind waiting for Vlad to circle back and respond to my comments here, then you can decide which route to take based on his feedback.

Thanks for jumping in and making this change! Definitely good to plumb this info through!

markfields avatar Aug 12 '22 21:08 markfields

@markfields, @vivekmogalapalli, if you really want to go with keeping disconnected state around, then I'd suggest decoupling it out of ConnectionState into separate variable - same as we track clientId. Notice that we have ConnectionState.EstablishingConnection (unused, but I assume we want to use it). There might be more states added in the future. It will be pain in the butt to maintain such complex data structures / transitions. It's much easier to say - we always keep disconnect reason, and reset it on disconnect event, and communicate on these 2-3 events.

While being very loosely related, worth pointing out that I raised https://github.com/microsoft/FluidFramework/pull/11895 that merges ConnectionStateHandler callbacks and communicates disconnect reason. If I were you, I'd go with very simple solution - not keeping any extra state, adding reason to disconnect event, and not care about having right reason at the end of boot sequence (this only matters if we were able to connect and disconnect while loading, which is not applicable to Bohemia that delays connectiosn).

vladsud avatar Sep 09 '22 06:09 vladsud

Thanks for the feedback everyone. I'm closing out this PR in favor of the simpler PR since this change is primarily for telemetry purposes and to keep changes to a minimum.

vivekmogalapalli avatar Sep 26 '22 19:09 vivekmogalapalli