FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Removed ScheduleManager and ScheduleManagerCore from container runtime

Open agarwal-navin opened this issue 1 year ago • 5 comments

Description

ScheduleManager and ScheduleManagerCore are no longer needed. They have two responsibilities as per the comment on the class:

/**
 * This class has the following responsibilities:
 *
 * 1. It tracks batches as we process ops and raises "batchBegin" and "batchEnd" events.
 * As part of it, it validates batch correctness (i.e. no system ops in the middle of batch)
 *
 * 2. It creates instance of ScheduleManagerCore that ensures we never start processing ops from batch
 * unless all ops of the batch are in.
 */

We don't need ScheduleManager to do these things anymore. Batching was complicated before because the Container / Deltamanager were in charge of deciding when a batch would be flushed. However, now the runtime is in control of sending a batch together and then processing it together.

This PR removes both these classes and incorporates the functionality into the processing pipeline directly as follows:

  1. Schedule manager tracked batches as we process ops and raises "batchBegin" and "batchEnd" events. As part of it, it validates batch correctness (i.e. no system ops in the middle of batch) - These events are emitted directly in ContainrRuntime::process which has knowledge of the entire batch. The batch correctness logic is moved into RemoteMessageProcessor::process which already had some of these validations in place.

  2. It creates instance of ScheduleManagerCore that ensures we never start processing ops from batch unless all ops of the batch are in - This functionality is already built in into the RemoteMessageProcessor::process which ensures that we don't start processing batches until we get the entire batch.

  3. ScheduleManager called DeltaScheduler on batch start and batch end which evaluated if it should yield a JS turn before processing more ops. This functionality is preserved and the container runtime calls DeltaScheduler directly now.

Testing

  • The existing tests for batch validation pass. The only change is the error strings are different for some of them.
  • The delta scheduler tests were in packages/test/functional-tests/src/test/containerRuntime.spec.ts. These tests have been moved to the packages/runtime/container-runtime/src/test/batching.spec.ts. The tests validate the same scenarios with updated patterns to validate them.
  • There was a delta manager test in packages/test/functional-tests/src/test/containerRuntime.spec.ts which has been moved to delta manager unit tests.

AB#13509

agarwal-navin avatar Aug 23 '24 19:08 agarwal-navin

@fluid-example/bundle-size-tests: -11.94 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 459.94 KB 456.93 KB -3.01 KB
azureClient.js 557.97 KB 554.97 KB -3 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 260.75 KB 257.72 KB -3.03 KB
fluidFramework.js 403.42 KB 403.43 KB +14 Bytes
loader.js 134.2 KB 134.22 KB +14 Bytes
map.js 42.25 KB 42.25 KB +7 Bytes
matrix.js 146.42 KB 146.42 KB +7 Bytes
odspClient.js 525.25 KB 522.26 KB -3 KB
odspDriver.js 97.66 KB 97.68 KB +21 Bytes
odspPrefetchSnapshot.js 42.72 KB 42.73 KB +14 Bytes
sharedString.js 163.12 KB 163.13 KB +7 Bytes
sharedTree.js 393.93 KB 393.94 KB +7 Bytes
Total Size 3.3 MB 3.29 MB -11.94 KB

Baseline commit: f5277b818bea5b4f40f9ad1972ca51d6c35dbf24

Generated by :no_entry_sign: dangerJS against cc04a4b9500e162fa93e468613baead816b412aa

msfluid-bot avatar Aug 23 '24 22:08 msfluid-bot

public process({ ...messageCopy }: ISequencedDocumentMessage, local: boolean) {

This message technically doesn't process the op anymore, it just constructs the batch as a pre-processing step to processBatch method you've introduced. But renaming this will be tough as it's public?


Refers to: packages/runtime/container-runtime/src/containerRuntime.ts:2649 in cc04a4b. [](commit_id = cc04a4b9500e162fa93e468613baead816b412aa, deletion_comment = False)

rajatch5 avatar Aug 27 '24 15:08 rajatch5

public process({ ...messageCopy }: ISequencedDocumentMessage, local: boolean) {

This message technically doesn't process the op anymore, it just constructs the batch as a pre-processing step to processBatch method you've introduced. But renaming this will be tough as it's public?

Refers to: packages/runtime/container-runtime/src/containerRuntime.ts:2649 in cc04a4b. [](commit_id = cc04a4b, deletion_comment = False)

Yeah, we need a process method since it's public on IRuntime. It can be seen as the core process logic which uses helpers to process different types of messages so it's probably fine..

agarwal-navin avatar Aug 27 '24 18:08 agarwal-navin

I believe it's the right direction, but it's worth calling out some risks that are not covered. The "old" files (files that were not modified for some amount of time) with trailing using "old" batching system will rely on ScheduleManager magic that is gone. Our document load flow brings all trailing ops (I believe for both ODSP & AFR). If app (like Loop) uses policy to apply all "cached" ops, then all such ops will be applied to container (in one go) before container is returned to caller, so in essence we do satisfy guarantees of non-breaking batches even without ScheduleManager in such case.

The part where it could break - using defaults and not applying cached ops at once. The chance of hitting a problem is higher for AzureClient 1.x documents, as such clients were on older version of FF longer. While chance is really low (it's not just that there needs to be trailing ops in old batch format, it's also that it should matter that they are not applied synchronously), it might be worth considering how we could help users if they were to run into a problem like that (i.e. if we cause a regression).

The only thing that comes to mind - wait for "connected' stated on boot, to ensure that all ops are applied before app moves to rendering / next steps.

vladsud avatar Aug 27 '24 21:08 vladsud

I believe it's the right direction, but it's worth calling out some risks that are not covered. The "old" files (files that were not modified for some amount of time) with trailing using "old" batching system will rely on ScheduleManager magic that is gone. Our document load flow brings all trailing ops (I believe for both ODSP & AFR). If app (like Loop) uses policy to apply all "cached" ops, then all such ops will be applied to container (in one go) before container is returned to caller, so in essence we do satisfy guarantees of non-breaking batches even without ScheduleManager in such case.

The part where it could break - using defaults and not applying cached ops at once. The chance of hitting a problem is higher for AzureClient 1.x documents, as such clients were on older version of FF longer. While chance is really low (it's not just that there needs to be trailing ops in old batch format, it's also that it should matter that they are not applied synchronously), it might be worth considering how we could help users if they were to run into a problem like that (i.e. if we cause a regression).

The only thing that comes to mind - wait for "connected' stated on boot, to ensure that all ops are applied before app moves to rendering / next steps.

Good point about legacy runtime messages. I was able to run the logic that ensures batches are processed together for these messages as well. I added tests and validated that it worked. However, when I run those against existing logic (without this PR), it fails in the below code with assert 0x29b:

if (!isRuntimeMessage(message)) {
	// Protocol messages should never show up in the middle of the batch!
	if (this.currentBatchClientId !== undefined) {
		throw DataProcessingError.create(
			"Received a system message during batch processing", // Formerly known as assert 0x29a
			"trackPending",
			message,
			{
				runtimeVersion: pkgVersion,
				batchClientId:
					// eslint-disable-next-line @typescript-eslint/prefer-nullish-coalescing
					this.currentBatchClientId === null ? "null" : this.currentBatchClientId,
				pauseSequenceNumber: this.pauseSequenceNumber,
				localBatch: this.currentBatchClientId === this.getClientId(),
				messageType: message.type,
			},
		);
	}

	assert(batchMetadata === undefined, 0x29b /* "system op in a batch?" */);
	return;
}

The isRuntimeMessage only checks whether the type is MessageType.Operation or not. So, for messages with type like ContainerMessageType.FluidDataStoreOp, it fails Basically, the current code doesn't work for these types of messages. I could fix it, but should I? Considering it doesn't even work.

agarwal-navin avatar Aug 28 '24 00:08 agarwal-navin

The isRuntimeMessage only checks whether the type is MessageType.Operation or not. So, for messages with type like ContainerMessageType.FluidDataStoreOp, it fails Basically, the current code doesn't work for these types of messages. I could fix it, but should I? Considering it doesn't even work.

@agarwal-navin Can you at least open a bug that describes the exact repro case? I think it might be a quick fix (add another check on the message type), so we should evaluate the risk/benefit there, we can discuss in the bug.

markfields avatar Aug 28 '24 23:08 markfields

Putting this on hold for now because it will break the last seq# and min seq# tracking in delta manager. In the mean time, I have another PR that simplfy schedule manager and the process pipeline- https://github.com/microsoft/FluidFramework/pull/22356.

agarwal-navin avatar Sep 03 '24 17:09 agarwal-navin

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!