FluidFramework
FluidFramework copied to clipboard
Removed ScheduleManager and ScheduleManagerCore from container runtime
Description
ScheduleManager and ScheduleManagerCore are no longer needed. They have two responsibilities as per the comment on the class:
/**
* This class has the following responsibilities:
*
* 1. It tracks batches as we process ops and raises "batchBegin" and "batchEnd" events.
* As part of it, it validates batch correctness (i.e. no system ops in the middle of batch)
*
* 2. It creates instance of ScheduleManagerCore that ensures we never start processing ops from batch
* unless all ops of the batch are in.
*/
We don't need ScheduleManager to do these things anymore. Batching was complicated before because the Container / Deltamanager were in charge of deciding when a batch would be flushed. However, now the runtime is in control of sending a batch together and then processing it together.
This PR removes both these classes and incorporates the functionality into the processing pipeline directly as follows:
-
Schedule manager tracked batches as we process ops and raises "batchBegin" and "batchEnd" events. As part of it, it validates batch correctness (i.e. no system ops in the middle of batch) - These events are emitted directly in
ContainrRuntime::processwhich has knowledge of the entire batch. The batch correctness logic is moved intoRemoteMessageProcessor::processwhich already had some of these validations in place. -
It creates instance of ScheduleManagerCore that ensures we never start processing ops from batch unless all ops of the batch are in - This functionality is already built in into the
RemoteMessageProcessor::processwhich ensures that we don't start processing batches until we get the entire batch. -
ScheduleManagercalledDeltaScheduleron batch start and batch end which evaluated if it should yield a JS turn before processing more ops. This functionality is preserved and the container runtime callsDeltaSchedulerdirectly now.
Testing
- The existing tests for batch validation pass. The only change is the error strings are different for some of them.
- The delta scheduler tests were in
packages/test/functional-tests/src/test/containerRuntime.spec.ts. These tests have been moved to thepackages/runtime/container-runtime/src/test/batching.spec.ts. The tests validate the same scenarios with updated patterns to validate them. - There was a delta manager test in
packages/test/functional-tests/src/test/containerRuntime.spec.tswhich has been moved to delta manager unit tests.
⯆ @fluid-example/bundle-size-tests: -11.94 KB
| Metric Name | Baseline Size | Compare Size | Size Diff |
|---|---|---|---|
| aqueduct.js | 459.94 KB | 456.93 KB | ⯆ -3.01 KB |
| azureClient.js | 557.97 KB | 554.97 KB | ⯆ -3 KB |
| connectionState.js | 680 Bytes | 680 Bytes | ■ No change |
| containerRuntime.js | 260.75 KB | 257.72 KB | ⯆ -3.03 KB |
| fluidFramework.js | 403.42 KB | 403.43 KB | ⯅ +14 Bytes |
| loader.js | 134.2 KB | 134.22 KB | ⯅ +14 Bytes |
| map.js | 42.25 KB | 42.25 KB | ⯅ +7 Bytes |
| matrix.js | 146.42 KB | 146.42 KB | ⯅ +7 Bytes |
| odspClient.js | 525.25 KB | 522.26 KB | ⯆ -3 KB |
| odspDriver.js | 97.66 KB | 97.68 KB | ⯅ +21 Bytes |
| odspPrefetchSnapshot.js | 42.72 KB | 42.73 KB | ⯅ +14 Bytes |
| sharedString.js | 163.12 KB | 163.13 KB | ⯅ +7 Bytes |
| sharedTree.js | 393.93 KB | 393.94 KB | ⯅ +7 Bytes |
| Total Size | 3.3 MB | 3.29 MB | ⯆ -11.94 KB |
Baseline commit: f5277b818bea5b4f40f9ad1972ca51d6c35dbf24
Generated by :no_entry_sign: dangerJS against cc04a4b9500e162fa93e468613baead816b412aa
public process({ ...messageCopy }: ISequencedDocumentMessage, local: boolean) {
This message technically doesn't process the op anymore, it just constructs the batch as a pre-processing step to processBatch method you've introduced. But renaming this will be tough as it's public?
Refers to: packages/runtime/container-runtime/src/containerRuntime.ts:2649 in cc04a4b. [](commit_id = cc04a4b9500e162fa93e468613baead816b412aa, deletion_comment = False)
public process({ ...messageCopy }: ISequencedDocumentMessage, local: boolean) {
This message technically doesn't process the op anymore, it just constructs the batch as a pre-processing step to
processBatchmethod you've introduced. But renaming this will be tough as it's public?Refers to: packages/runtime/container-runtime/src/containerRuntime.ts:2649 in cc04a4b. [](commit_id = cc04a4b, deletion_comment = False)
Yeah, we need a process method since it's public on IRuntime. It can be seen as the core process logic which uses helpers to process different types of messages so it's probably fine..
I believe it's the right direction, but it's worth calling out some risks that are not covered. The "old" files (files that were not modified for some amount of time) with trailing using "old" batching system will rely on ScheduleManager magic that is gone. Our document load flow brings all trailing ops (I believe for both ODSP & AFR). If app (like Loop) uses policy to apply all "cached" ops, then all such ops will be applied to container (in one go) before container is returned to caller, so in essence we do satisfy guarantees of non-breaking batches even without ScheduleManager in such case.
The part where it could break - using defaults and not applying cached ops at once. The chance of hitting a problem is higher for AzureClient 1.x documents, as such clients were on older version of FF longer. While chance is really low (it's not just that there needs to be trailing ops in old batch format, it's also that it should matter that they are not applied synchronously), it might be worth considering how we could help users if they were to run into a problem like that (i.e. if we cause a regression).
The only thing that comes to mind - wait for "connected' stated on boot, to ensure that all ops are applied before app moves to rendering / next steps.
I believe it's the right direction, but it's worth calling out some risks that are not covered. The "old" files (files that were not modified for some amount of time) with trailing using "old" batching system will rely on ScheduleManager magic that is gone. Our document load flow brings all trailing ops (I believe for both ODSP & AFR). If app (like Loop) uses policy to apply all "cached" ops, then all such ops will be applied to container (in one go) before container is returned to caller, so in essence we do satisfy guarantees of non-breaking batches even without ScheduleManager in such case.
The part where it could break - using defaults and not applying cached ops at once. The chance of hitting a problem is higher for AzureClient 1.x documents, as such clients were on older version of FF longer. While chance is really low (it's not just that there needs to be trailing ops in old batch format, it's also that it should matter that they are not applied synchronously), it might be worth considering how we could help users if they were to run into a problem like that (i.e. if we cause a regression).
The only thing that comes to mind - wait for "connected' stated on boot, to ensure that all ops are applied before app moves to rendering / next steps.
Good point about legacy runtime messages. I was able to run the logic that ensures batches are processed together for these messages as well. I added tests and validated that it worked.
However, when I run those against existing logic (without this PR), it fails in the below code with assert 0x29b:
if (!isRuntimeMessage(message)) {
// Protocol messages should never show up in the middle of the batch!
if (this.currentBatchClientId !== undefined) {
throw DataProcessingError.create(
"Received a system message during batch processing", // Formerly known as assert 0x29a
"trackPending",
message,
{
runtimeVersion: pkgVersion,
batchClientId:
// eslint-disable-next-line @typescript-eslint/prefer-nullish-coalescing
this.currentBatchClientId === null ? "null" : this.currentBatchClientId,
pauseSequenceNumber: this.pauseSequenceNumber,
localBatch: this.currentBatchClientId === this.getClientId(),
messageType: message.type,
},
);
}
assert(batchMetadata === undefined, 0x29b /* "system op in a batch?" */);
return;
}
The isRuntimeMessage only checks whether the type is MessageType.Operation or not. So, for messages with type like ContainerMessageType.FluidDataStoreOp, it fails
Basically, the current code doesn't work for these types of messages. I could fix it, but should I? Considering it doesn't even work.
The isRuntimeMessage only checks whether the type is MessageType.Operation or not. So, for messages with type like ContainerMessageType.FluidDataStoreOp, it fails Basically, the current code doesn't work for these types of messages. I could fix it, but should I? Considering it doesn't even work.
@agarwal-navin Can you at least open a bug that describes the exact repro case? I think it might be a quick fix (add another check on the message type), so we should evaluate the risk/benefit there, we can discuss in the bug.
Putting this on hold for now because it will break the last seq# and min seq# tracking in delta manager. In the mean time, I have another PR that simplfy schedule manager and the process pipeline- https://github.com/microsoft/FluidFramework/pull/22356.
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!