FluidFramework Additional container load and system op telemetry

During OCE shift we ran into couple of investigation areas where additional telemetry would have been helpful:

Connection state. We had two separate live issue where host was timing out indicating container could not "connect". We made an assumption (that turned out to be correct) that container load options were set to delay delta connection and "connect" on the container was never explicitly called.
Corruption due to duplicate join/leave ops (same client ID joining or leaving multiple times). We were able to isolate the scenario to particular sequence of ops (finally it was server that was creating dups), but we could not figure out exactly which join/leave op was a duplicate one.

Surfacing telemetry on:

Container load options.
Client IDs on system ops.

Sep 08 '22 00:09 ssimic2

We have mostly added perf telemetry around container/loader. Would it be helpful to add general telemetry on some of the key container APIs? Container/loader seems like a common entry point. For example given that host can load container without connecting, telemetry on connect/disconnect would be helpful when debugging some of the connectivity issues.

Sep 08 '22 00:09 ssimic2

⯅ @fluid-example/bundle-size-tests: +104 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	397.35 KB	397.35 KB	■ No change
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	196.29 KB	196.29 KB	■ No change
loader.js	153.46 KB	153.56 KB	⯅ +104 Bytes
map.js	42.89 KB	42.89 KB	■ No change
matrix.js	131.83 KB	131.83 KB	■ No change
odspDriver.js	151.71 KB	151.71 KB	■ No change
odspPrefetchSnapshot.js	39.79 KB	39.79 KB	■ No change
sharedString.js	153.04 KB	153.04 KB	■ No change
Total Size	1.27 MB	1.27 MB	⯅ +104 Bytes

Baseline commit: bedc5040596f399fe896d5aca7f25c588a493d54

Generated by :no_entry_sign: dangerJS against 9d4039957cfa5462a76438ee5d2660949848eb81

Sep 08 '22 01:09 msfluid-bot

Duplciate PR? Or maybe for different branch? I've left comments in the other one

Sep 08 '22 03:09 vladsud

Duplciate PR? Or maybe for different branch? I've left comments in the other one

This should be the only PR. I see your comments here. Thanks for the quick review!

Sep 09 '22 05:09 ssimic2

I'll lost track of this PR. I'm inclined to close it for couple of reasons:

Load options were included as part of the "end" event. During our OCE shift I expected config to be logged on "start", so I made quick assumption we did not have it. So data/telemetry is there, but it was not at an obvious place (for me at least).
It is valid suggestion Vlad made around not baking clientID into error for aggregation purposes. However, to use LoggingError here we would need to move @fluidframework/telemetry-utils our of client packages, and think in general how telemetry from server packages translates to client errors (another topic above on DataCorruptionError vs DataProcessingError)

Lessons learned: Close or complete your old PRs.

Nov 10 '22 17:11 ssimic2

FluidFramework FluidFramework copied to clipboard

Additional container load and system op telemetry

FluidFramework
FluidFramework copied to clipboard