FluidFramework
FluidFramework copied to clipboard
Additional container load and system op telemetry
During OCE shift we ran into couple of investigation areas where additional telemetry would have been helpful:
-
Connection state. We had two separate live issue where host was timing out indicating container could not "connect". We made an assumption (that turned out to be correct) that container load options were set to delay delta connection and "connect" on the container was never explicitly called.
-
Corruption due to duplicate join/leave ops (same client ID joining or leaving multiple times). We were able to isolate the scenario to particular sequence of ops (finally it was server that was creating dups), but we could not figure out exactly which join/leave op was a duplicate one.
Surfacing telemetry on:
- Container load options.
- Client IDs on system ops.
We have mostly added perf telemetry around container/loader. Would it be helpful to add general telemetry on some of the key container APIs? Container/loader seems like a common entry point. For example given that host can load container without connecting, telemetry on connect/disconnect would be helpful when debugging some of the connectivity issues.
⯅ @fluid-example/bundle-size-tests: +104 Bytes
Metric Name | Baseline Size | Compare Size | Size Diff |
---|---|---|---|
aqueduct.js | 397.35 KB | 397.35 KB | ■ No change |
connectionState.js | 680 Bytes | 680 Bytes | ■ No change |
containerRuntime.js | 196.29 KB | 196.29 KB | ■ No change |
loader.js | 153.46 KB | 153.56 KB | ⯅ +104 Bytes |
map.js | 42.89 KB | 42.89 KB | ■ No change |
matrix.js | 131.83 KB | 131.83 KB | ■ No change |
odspDriver.js | 151.71 KB | 151.71 KB | ■ No change |
odspPrefetchSnapshot.js | 39.79 KB | 39.79 KB | ■ No change |
sharedString.js | 153.04 KB | 153.04 KB | ■ No change |
Total Size | 1.27 MB | 1.27 MB | ⯅ +104 Bytes |
Baseline commit: bedc5040596f399fe896d5aca7f25c588a493d54
Generated by :no_entry_sign: dangerJS against 9d4039957cfa5462a76438ee5d2660949848eb81
Duplciate PR? Or maybe for different branch? I've left comments in the other one
Duplciate PR? Or maybe for different branch? I've left comments in the other one
This should be the only PR. I see your comments here. Thanks for the quick review!
I'll lost track of this PR. I'm inclined to close it for couple of reasons:
- Load options were included as part of the "end" event. During our OCE shift I expected config to be logged on "start", so I made quick assumption we did not have it. So data/telemetry is there, but it was not at an obvious place (for me at least).
- It is valid suggestion Vlad made around not baking clientID into error for aggregation purposes. However, to use LoggingError here we would need to move @fluidframework/telemetry-utils our of client packages, and think in general how telemetry from server packages translates to client errors (another topic above on DataCorruptionError vs DataProcessingError)
Lessons learned: Close or complete your old PRs.