FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Second Attempt to fix AB 218 - Files with lot of unsummarized Ops.

Open NicholasCouri opened this issue 3 years ago • 1 comments

The problem happens when we generate thousands of ops in a short period of time and get throttled by the push service. The users keep sending ops and, in case we are eventually able to recover, after retrying to connect (respecting imposed delay by the service) and submitting the outstanding ops, we reach a point in which the Summarizer needs to generate a snapshot as the server is now throttling the clients due to the limit of unsummarized ops.

At this point, the server starts to send the following error: NACK (ThrottlingError): Submit a summary before inserting additional ops.

From this moment on, the client goes into in a loop trying to launch a summarizer and immediately asking it to stop as the parent of the summarizer is getting disconnected/throttled after trying to send an op (probably join ?).

Description

Before this change, if the parent is not connected, we would immediately ask the summarizer to stop. With these changes, we will only ask the summarizer to stop in case the last summary would not be generated ( stopReason !=="parentNotConnected" )

Notice that we considered removing the entire short-circuit unconditionally but decided not to as to preserve resources as launching the summarizer against a scenario that is guaranteed to NOT generate the last summary would be a waste of resources.

I took the chance to add the stop reason for the scenario in which we return early.

Does this introduce a breaking change?

No.

Any relevant information

I did run the stress tests under the debugger and, when the condition we care about was hit, we successfully summarized.

As a side not from the impact of these changes, in the last 30 days:

  • 231638 documents hit the "early exit after starting summarizer"

Note: I still need to run the test coverage to see if it does not timeout.

NicholasCouri avatar Aug 10 '22 17:08 NicholasCouri

@fluid-example/bundle-size-tests: +328 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 388.72 KB 388.88 KB +164 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 191.9 KB 192.06 KB +164 Bytes
loader.js 151.12 KB 151.12 KB No change
map.js 42.63 KB 42.63 KB No change
matrix.js 128.09 KB 128.09 KB No change
odspDriver.js 150.23 KB 150.23 KB No change
odspPrefetchSnapshot.js 38.39 KB 38.39 KB No change
sharedString.js 148.73 KB 148.73 KB No change
Total Size 1.24 MB 1.24 MB +328 Bytes

Baseline commit: f63f1db894d53e997899ba9a73fcc9078d8dbd1c

Generated by :no_entry_sign: dangerJS against 268cf0717cfe221fb1a0e85128e61ff56d47d52e

msfluid-bot avatar Aug 10 '22 23:08 msfluid-bot

Is there a way to replicate this in the tests?

tyler-cai-microsoft avatar Aug 10 '22 23:08 tyler-cai-microsoft

Is there a way to replicate this in the tests?

Just added it :)

NicholasCouri avatar Aug 11 '22 07:08 NicholasCouri

This commit is queued for merging with the next branch! Please ignore this PR for now. Contact @microsoft/fluid-cr-infra for help.

github-actions[bot] avatar Aug 11 '22 21:08 github-actions[bot]