FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Additional container load and system op telemetry

Open ssimic2 opened this issue 2 years ago • 4 comments

During OCE shift we ran into couple of investigation areas where additional telemetry would have been helpful:

  • Connection state. We had two separate live issue where host was timing out indicating container could not "connect". We made an assumption (that turned out to be correct) that container load options were set to delay delta connection and "connect" on the container was never explicitly called.

  • Corruption due to duplicate join/leave ops (same client ID joining or leaving multiple times). We were able to isolate the scenario to particular sequence of ops (finally it was server that was creating dups), but we could not figure out exactly which join/leave op was a duplicate one.

Surfacing telemetry on:

  1. Container load options.
  2. Client IDs on system ops.

ssimic2 avatar Sep 08 '22 00:09 ssimic2

We have mostly added perf telemetry around container/loader. Would it be helpful to add general telemetry on some of the key container APIs? Container/loader seems like a common entry point. For example given that host can load container without connecting, telemetry on connect/disconnect would be helpful when debugging some of the connectivity issues.

ssimic2 avatar Sep 08 '22 00:09 ssimic2

@fluid-example/bundle-size-tests: +104 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 397.35 KB 397.35 KB No change
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 196.29 KB 196.29 KB No change
loader.js 153.46 KB 153.56 KB +104 Bytes
map.js 42.89 KB 42.89 KB No change
matrix.js 131.83 KB 131.83 KB No change
odspDriver.js 151.71 KB 151.71 KB No change
odspPrefetchSnapshot.js 39.79 KB 39.79 KB No change
sharedString.js 153.04 KB 153.04 KB No change
Total Size 1.27 MB 1.27 MB +104 Bytes

Baseline commit: bedc5040596f399fe896d5aca7f25c588a493d54

Generated by :no_entry_sign: dangerJS against 9d4039957cfa5462a76438ee5d2660949848eb81

msfluid-bot avatar Sep 08 '22 01:09 msfluid-bot

Duplciate PR? Or maybe for different branch? I've left comments in the other one

vladsud avatar Sep 08 '22 03:09 vladsud

Duplciate PR? Or maybe for different branch? I've left comments in the other one

This should be the only PR. I see your comments here. Thanks for the quick review!

ssimic2 avatar Sep 09 '22 05:09 ssimic2

I'll lost track of this PR. I'm inclined to close it for couple of reasons:

  1. Load options were included as part of the "end" event. During our OCE shift I expected config to be logged on "start", so I made quick assumption we did not have it. So data/telemetry is there, but it was not at an obvious place (for me at least).
  2. It is valid suggestion Vlad made around not baking clientID into error for aggregation purposes. However, to use LoggingError here we would need to move @fluidframework/telemetry-utils our of client packages, and think in general how telemetry from server packages translates to client errors (another topic above on DataCorruptionError vs DataProcessingError)

Lessons learned: Close or complete your old PRs.

ssimic2 avatar Nov 10 '22 17:11 ssimic2