FluidFramework Op bunching 1: Bunch contiguous ops for data store in a batch

Reviewer guidance

This is part 1 or 3 of the op bunching feature. The PR with the end-to-end feature is https://github.com/microsoft/FluidFramework/pull/22686. It has been broken down for simpler review process. This part focuces on the changes in the Runtime layer.

Note - This change breaks the snapshot tests because old snapshots with merge tree had "catchupOps" blobs which contained ops. Now that "metadata" property is not sent to DDS, comparing snapshots fails because of the metadata property not present in latest snapshots for these merge tree instances. A code change is done in snapshot normalizer for this.

Problem

During op processing, container runtime sends ops one at a time to data stores to DDSes. If a DDS has received M contiguous ops as part of a batch, the DDS is called M times to process them individually. This has performance implications for some DDSes and they would benefit from receiving and processing these M ops together.

Take shared tree for example: For each op received which has a sequenced commit, all the pending commits are processed by the rebaser. So, as the number of ops received grows, so does the processing of pending commits. The following example describes this clearly: Currently if a shared tree client has N pending commits which have yet to be sequenced, each time that client receives a sequenced commit authored by another client (an op), it will update each of its pending commits which takes at least O(N) work. Instead, if it receives M commits at once, it could do a single update pass on each pending commit instead of M per pending commit. It can compose the M commits together into a single change to update over, so it can potentially go from something like O (N * M) work to O (N + M) work with batching.

Solution - op bunching

The solution implemented here is a feature called "op bunching". With this feature, contiguous ops in a grouped op batch that belong to a data store / DDS will be bunched and sent to it in an array - The grouped op is sent as an ISequencedRuntimeMessage and the individual message contents in it are sent as an array along with the clientSequenceNumber. The container runtime will send bunch of contiguous ops for each data store to it. The data store will send bunch of contiguous ops for each DDS to it. The DDS can choose how to process these ops. Shared tree for instance, would compose the commits in all these ops and update pending commits with it. Bunching only contiguous ops for a data store / DDS in a batch preserves the behavior of processing ops in the sequence it was received.

Couple of behavior changes to note:

Op events - An implication of this change is the timing of "op" events emitted by container runtime and data store runtime will change. Currently, these layers emit an "op" event immediately after an op is processed. With this change, an upper layer will only know when a bunch has been processed by a lower layer. So, it will emit "op" events for individual ops in the bunch after the entire bunch is processed. From my understanding, this should be fine because we do not provide any guarantee that the "op" event will be emitted immediately after an op is processed. These events will be emitted in order of op processing and (sometime) after the op is processed. Take delta manager / container runtime as an example. Delta manager sends an op for processing to container runtime and emits the "op" event. However, container runtime may choose to not process these ops immediately but save them until an entire batch is received. This change was made but was reverted due to some concerns not related to the topic discussed here - https://github.com/microsoft/FluidFramework/pull/21785. The chang here is similar to the above behavior where an upper layer doesn't know and shouldn't care what lower layers do with ops.
metadata property on message - With this PR, the metadata property is removed from a message before it's sent to data stores and DDS. This is because we now send one common message (the grouped op) and an array of contents. Individual messages within a grouped op have batch begin and end metadata but they are just added by the runtime to keep it like old batch messages. The data store and DDS don't care about it so removing them should be fine. This also results in the "snapshot test" failing as explained before and this PR contains a fix for that.

AB#20123

Oct 17 '24 22:10 agarwal-navin

⯅ @fluid-example/bundle-size-tests: +5.21 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	459.85 KB	461.13 KB	⯅ +1.28 KB
azureClient.js	557 KB	558.29 KB	⯅ +1.29 KB
connectionState.js	724 Bytes	724 Bytes	■ No change
containerRuntime.js	259.47 KB	260.73 KB	⯅ +1.26 KB
fluidFramework.js	405.97 KB	405.98 KB	⯅ +14 Bytes
loader.js	134.16 KB	134.18 KB	⯅ +14 Bytes
map.js	42.46 KB	42.46 KB	⯅ +7 Bytes
matrix.js	148.29 KB	148.29 KB	⯅ +7 Bytes
odspClient.js	523.96 KB	525.25 KB	⯅ +1.29 KB
odspDriver.js	97.84 KB	97.86 KB	⯅ +21 Bytes
odspPrefetchSnapshot.js	42.81 KB	42.82 KB	⯅ +14 Bytes
sharedString.js	164.48 KB	164.49 KB	⯅ +7 Bytes
sharedTree.js	396.43 KB	396.43 KB	⯅ +7 Bytes
Total Size	3.31 MB	3.31 MB	⯅ +5.21 KB

Baseline commit: b30731fbadbcf353aacffd8b35c6815ce70019bc

Generated by :no_entry_sign: dangerJS against 42b90c9d4582236308e334030d8bfafb151c8518

Oct 18 '24 02:10 msfluid-bot

🔗 No broken links found! ✅

Your attention to detail is admirable.

linkcheck output


> [email protected] ci:linkcheck /home/runner/work/FluidFramework/FluidFramework/docs
> start-server-and-test ci:start 1313 linkcheck:full

1: starting server using command "npm run ci:start"
and when url "[ 'http://127.0.0.1:1313' ]" is responding with HTTP status code 200
running tests using command "npm run linkcheck:full"


> [email protected] ci:start
> http-server ./public --port 1313 --silent


> [email protected] linkcheck:full
> npm run linkcheck:fast -- --external


> [email protected] linkcheck:fast
> linkcheck http://localhost:1313 --skip-file skipped-urls.txt --external

Crawling...

Stats:
  439860 links
    3391 destination URLs
       2 URLs ignored
       0 warnings
       0 errors

Oct 26 '24 00:10 github-actions[bot]

Op bunching 1: Bunch contiguous ops for data store in a batch - Runtime part

Reviewer guidance

Problem

Solution - op bunching

linkcheck output