FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

[Proof-of-concept only] Introduce experimental ConnectionDiagnostics object to give hosts insight into slow reconnects

Open markfields opened this issue 11 months ago • 4 comments

Description

There are many issues that can occur when attempting to connect or reconnect to the ordering service. When FF is "stuck" in the "EstablishingConnection" state for an extended time period, a host app has no visibility into why. Same can go for "CatchingUp" state (e.g. trouble fetching ops to catch up, or stuck waiting for join/leave op).

This is an important gap to fill for two reasons:

  • Empowering the app dev to diagnose issues without digging through undocumented/unstable FF diagnostic logs
  • Depending on the reason (e.g. auth) or state (e.g. catchingUp v. establishingConnection), there could be different error UX presented to the user.  

Breaking Changes

The proposal is to expose this on an experimental/beta interface, with early adopters casting Container to that interface to get access to it. This will allow us to get feedback and stabilize before locking it down (esp with 2.0 coming).

Reviewer Guidance

Key files are Container.ts (definition of new API), ConnectionStateHandler.ts (updates the log during state transitions), and ConnectionManager.ts (updates the log during sub-steps during "EstablishingConnection" state).

Please see the sample log to get a picture for how this data will look. Read it from the bottom up. Mid-session, if a host grabbed the log, it would find the current connection attempt / state / step at position [0] in each nested array, and could compare the timestamps to see how long we've been there, or could traverse the previous steps to find long durations or suspicious errors, etc.

No plans to merge this as-is

This PR will be closed and can act as a starting point / reference when we get to actually implementing this (once we reach consensus on the design). I expect the high-level design would stick (e.g. shape of the diagnostic log, ConnectionStateHandler's ownership of updating it) but other details will change (exactly what info is plumbed through when/how, what info can be added), and there are plenty of gaps / to-do's.

markfields avatar Mar 27 '24 13:03 markfields

@fluid-example/bundle-size-tests: +894 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 520.5 KB 520.5 KB No change
azureClient.js 611.64 KB 611.9 KB +268 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 254.75 KB 254.75 KB No change
fluidFramework.js 342.62 KB 342.62 KB No change
loader.js 127.97 KB 128.23 KB +268 Bytes
map.js 41.35 KB 41.35 KB No change
matrix.js 143.61 KB 143.61 KB No change
odspClient.js 580.09 KB 580.4 KB +312 Bytes
odspDriver.js 97.49 KB 97.53 KB +44 Bytes
odspPrefetchSnapshot.js 41.91 KB 41.91 KB No change
sharedString.js 161.38 KB 161.38 KB No change
sharedTree.js 332.73 KB 332.73 KB No change
Total Size 3.33 MB 3.33 MB +894 Bytes

Baseline commit: 14314df41e49f5f355765d99db03642132e6d392

Generated by :no_entry_sign: dangerJS against 576752bd4f07405dc70b99600283b81973263ec9

msfluid-bot avatar Mar 27 '24 14:03 msfluid-bot

I'd love to start with end-to-end example of usage, even if it does not compile. This would be so much better at demonstrating potential scenario and needs. What's not clear to me is - do we expect applications to change their behavior based on this data, and if yes - in what form? I can totally see a case where they keep user more informed, telling user about the fact that we are not connected due to service issues, or lack of network, or something else. I can also see them doing their own logging based on such data. But is that it?

vladsud avatar Apr 09 '24 03:04 vladsud

End to end demo would be good I agree.

Use case is primarily Observability via app telemetry - independence from having to sort out FF logs. And getting more info than we expose at times. When the banner shows they want to build a heuristic for "why" it is taking so long. I'm just trying to give them transparency.

There are a small set of cases where they would adjust UX based on what they find - Examples I've heard are:

  1. If it's in CatchingUp phase
  2. If the spot it's stuck is an auth-related

That's an open question here - where do we introduce stable enums that can be coded against v. Freeform strings. TBD based on feedback loop with partners.

markfields avatar Apr 11 '24 05:04 markfields

Btw @vladsud your specific code/comments were mostly you tripping over NYI parts. Sorry about that. I've cleaned up that part somewhat, enough to illustrate how it would work.

markfields avatar Apr 11 '24 05:04 markfields

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!