FluidFramework Milan fluid summary compress

Note

This Draft PR is the PoC of General Summary Compression Feature. It is not expected to be merged. It is implemented only partially. The main expectation is to review the design and identify the challenges.

Introduction

IDocumentStorageService is the interface of the object, through which, the summary is uploaded to / downloaded from Fluid Server.
The instance of the IDocumentStorageService is assigned to the ContainerRuntime class member variable _storage in containerRuntime.ts
Implementation class of the IDocumentStorageService instance is ContainerStorageAdapter
The code uses nodejs Buffer object. Let's discuss whether this is acceptable. It looks that it is used on other places of Fluid framework either.

Methods of `IDocumentStorageService` interface handling the summary upload / download

createBlob
readBlob
uploadSummaryWithContext
getSnapshotTree
downloadSummary

Implementation

Note

This is the incomplete implementation. In this first shot, it is not clear, where the methods

getSnapshotTree
downloadSummary

are called so these methods are not having the decompression functionality implemented (expecting that these methods retrieve summary from server).

Code

Location

Code is implemented in 3 new files and 1 edited file (small change)

packages/runtime/container-runtime/src/summaryStorageAdapter.ts
packages/runtime/container-runtime/src/summaryStorageCompressionHooks.ts
packages/runtime/container-runtime/src/summaryBlobProtocol.ts
packages/runtime/container-runtime/src/containerRuntime.ts

Design Overview

Implementation based on delegation pattern
SummaryStorageAdapter class delegates the calls to underlying IDocumentStorageService instance
It also adds calls to preprocessing and postprocessing hooks.
Hooks are abstract / pluggable
The present implementation of hooks contains the compression algorithms
It is however simple to add further hooks such as encryption

Blob protocol

This implementation compresses only blobs
It is needed that old versions of summaries are still supported
We need to add additional info into the blob binary representation which tells us, which compression algorithm is used
We also need to support the cases where this info is not present
The Blob Header is inserted at the beginning of the blob binary

Blob Header

HEADER ID : 16 bytes of unique identifier : f4e72832a5bd47d7a2f35ed47ed94a3c
HEADER CONTENT LENGTH : 4 bytes of length of the rest of header content (after the length field)
HEADER CONTENT : HEADER CONTENT LENGTH bytes
HEADER CONTENT contains key-value pairs

Compression

Blob Header contains key 'ALG' and value is number representing the algorithm : NONE = 1, LZ4 = 2, DEFLATE = 3,
Compression happens based on the requirement (Algorithm number identifier in the constructor of the hook)
The compression algorithm is hardcoded, it could also be assigned as the function into the hook constructor
However for decompression there would need to be existing repository of algorithms and number identifiers to choose the proper one on receiving the blob with header
If the BlobHeader is not present (f4e72832a5bd47d7a2f35ed47ed94a3c not found at the position 1), no decompression happens on receiving summary, it is passed as received

uploadSummaryWithContext challenges

It looks like there no symmetric download method to this upload method (uploadSummaryWithContext)
The download is performed via readBlob method (at least in some DDSes)
uploadSummaryWithContext processes ISummaryTree object
It contains ISummaryBlob objects which are supposed to be compresses
ISummaryBlob can contain either binary or string content
String content must be converted to binary, UTF encoding is used (is this correct?)
Then content is compressed and header is inserted

Main Issue

The ISummaryBlob can get binary content
When binary content is set it is failing at summaryTreeUploadManager#writeSummaryBlob
Assertion happens at comparing different methods to generate hash at assert(hash === blob.sha

    async writeSummaryBlob(content) {
        const { parsedContent, encoding } = typeof content === "string"
            ? { parsedContent: content, encoding: "utf-8" }
            : { parsedContent: Uint8ArrayToString(content, "base64"), encoding: "base64" };
        // The gitHashFile would return the same hash as returned by the server as blob.sha
        const hash = await gitHashFile(IsoBuffer.from(parsedContent, encoding));
        if (!this.blobsShaCache.has(hash)) {
            this.blobsShaCache.set(hash, "");
            const blob = await this.manager.createBlob(parsedContent, encoding);
            assert(hash === blob.sha, 0x0b6 /* "Blob.sha and hash do not match!!" */);
        }
        return hash;
    }

Workaround

The workaround is to base64 encode the binary compressed blob content with header
The read blob must expect it
Read blob searches for header in the binary
If not found, it does base64 decode and then searches for header
If not found, no decompression happens

summaryStorageAdapter.ts

Contains the delegation class of IDocumentStorageService interface which applies preprocessing and postprocessing hooks
Contains multi-hook class which can apply various hooks (for example if we want to encrypt and compress)
Contains helping functions like retrieving blob / replacing blob from ISummaryTree and traversing ISummaryTree to get blob paths
Contains factory function to instantiate SummaryStorageAdapter

summaryBlobProtocol.ts

Contains BlobHeader class which can store key/value pairs and generate binary array
Contains function which inserts the binary representation of BlobHeader into the binary blob
Contain function to read the BlobHeader from the binary blob
Contain function to cut off the BlobHeader binary representation from the binary blob

summaryStorageCompressionHooks.ts

Contains implementation of hooks
Contains compressing of blob
Contains decompressing the blob
Contains traversing and compressing blobs in ISummaryTree

Questions and Arguments

Should the summary compression be configurable? How should we configure the summary compression? IRuntimeOptions? (BaseContainerRuntimeFactory)
Can we use the common approach with message compression (specify threshold and algorithm)
The compression info is stored in the compressed Blob, do you prefer rather flag at the ISummaryTree? This would probably suppose heavy changes within DDSes as they load Blobs directly (at least SharedPropertyTree does). Also the intercepted methods at IDocumentStorageService such as readBlob do not reach ISummaryTree
If the service still want to decompress, is it acceptable to use the Blob header or we need ISummaryTree entry to show algorithm

Aug 19 '22 13:08 milanro

@vladsud @DLehenbauer @dstanesc This draft PR is the partial PoC implementation of the general compression of summaries at the Fluid Framework. It is not fully implemented and is not expected to be merged into main. It is the base for our further discussions as it represents my view, how / where the compression could be implemented. In any case, it is functional at the common use-case (usual summary generation / load)

I believe that it is relevant especially for @vladsud. I would like to ask you to possibly looks at it and open further discussions in this PR.

Just a point that this is based solely on OO, not using functional programming, we can further discuss whether you are ok with it.

Aug 22 '22 09:08 milanro

Maybe also relevant for @justus-camp

Aug 22 '22 09:08 milanro

This is an excellent discussion starter and can work conceptually pretty well with the existing implementation (ie. Property tree summarizer draws efficiency by maximizing blob size to the platform limits hence very good compression rates).

However, once incremental summaries are available, efficiency definition for summary transfer slightly changes. For instance the ideal scenario is when summary blobs equate in size the actual changes.

Summary partitioning algorithm (equally probabilistic or explicit) will rather produce (for typical use-cases) smaller size blobs, to be optimized heuristically per each individual domain/case but likely in the (min 8KiB, avg 16KiB, max 32 KiB) range. Such blobs are not compressing very well.

The design I have in mind would circle back to the original discussion on the differentiation between compression for the data transfer and the compression for the data storage. In a sense this simplifies the problem (not the implementation!) as there are ubiquitous patterns we could apply. One example at hand is http client-server compression, but probably closer to Fluid is the Kafka message compression:

Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression) See doc

Therefore IMHO a canonical design for summary compression would have to include on the client:

support for fine granular increments/blobs
ability to batch the summary increments/blobs
ability to compress the batches of summary increments/blobs
ability to partition (fixed size) the compressed batches for efficient transfer

The server will need to provide support for above

Aug 23 '22 12:08 dstanesc

Summary partitioning algorithm (equally probabilistic or explicit) will rather produce (for typical use-cases) smaller size blobs, to be optimized heuristically per each individual domain/case but likely in the (min 8KiB, avg 16KiB, max 32 KiB) range. Such blobs are not compressing very well.

I have a feeling that there will always be some optimal threshold and maybe 8k blobs will not be appropriate for let's say 1GB DDS (which would result in 131072 blobs). And the compression is very relevant here also due to reducing space on the server. Maybe that is a strong reason, why blobs prior compression should be bigger.

Aug 23 '22 13:08 milanro

I was touching the topics above, but probably more clarification on the presented viewpoint won't hurt:

I have a feeling that there will always be some optimal threshold and maybe 8k blobs will not be appropriate for let's say 1GB DDS (which would result in 131072 blobs).

As mentioned above, chunk sizes should be correlated w/ domain and use-cases (ie commit size). I see 2 possible scenarios to motivate large summaries, both valid:

Historical growth, possibly by small ops (I/O would benefit from configuring smaller chunking threshold)
The use case produces only large ops (I/O would benefit from configuring larger chunking threshold)

And the compression is very relevant here also due to reducing space on the server.

As mentioned above, I have my mind on the logical differentiation between compression for the data transfer and the compression for the data storage. They are conceptually orthogonal and could be implemented as such. Even more, one would typically have different compression rate preference for in-flight data (speed) vs. data at rest (size)

Aug 23 '22 13:08 dstanesc

As mentioned above, I have my mind on the logical differentiation between compression for the data transfer and the compression for the data storage. They are conceptually orthogonal and could be implemented as such. Even more, one would typically have different compression rate preference for in-flight data (speed) vs. data at rest (size)

I expect that it is not foreseen that server would do any further compression and spend the CPU time on any such heavy operation. I'd rather expect that the size of your data stored on the server would influence your price of the service and that it would be in your best interest to keep it as small as possible.

Aug 23 '22 14:08 milanro

⯅ @fluid-example/bundle-size-tests: +8.33 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	437.17 KB	441.23 KB	⯅ +4.06 KB
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	230.47 KB	234.54 KB	⯅ +4.07 KB
loader.js	153.32 KB	153.32 KB	■ No change
map.js	43.85 KB	43.91 KB	⯅ +65 Bytes
matrix.js	136.65 KB	136.71 KB	⯅ +65 Bytes
odspDriver.js	92.06 KB	92.06 KB	■ No change
odspPrefetchSnapshot.js	43.4 KB	43.4 KB	■ No change
sharedString.js	157.19 KB	157.26 KB	⯅ +65 Bytes
Total Size	1.37 MB	1.38 MB	⯅ +8.33 KB

Baseline commit: 1c864bbe0ea60a870fe279011ff6d6ba00d36d49

Generated by :no_entry_sign: dangerJS against 59fdce0cc92fc70e84ddf618816f0ac132d13d4a

Oct 14 '22 11:10 msfluid-bot

@vladsud @DLehenbauer This PR is now ready for review and possible merging.

Oct 20 '22 05:10 milanro

@milanro - Just FYI - We're going to have more nit picky feedback on this PR, but do plan to merge it. You have the fundamentally correctly approach.

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

Nov 07 '22 23:11 DLehenbauer

@milanro - Just FYI - We're going to have more nit picky feedback on this PR, but do plan to merge it. You have the fundamentally correctly approach.

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

@dstanesc @DLehenbauer The facility for chaining the adapters (or so-called hooks) is implemented in this PR already. Next hook performing the chunking can be added when constructing the storage in containerRuntime. So I believe, either @dstanesc should wait for this PR to be merged or he should build upon it in the new branch.

Nov 08 '22 13:11 milanro

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

I believe so. Adopting the pattern for summary compaction would definitely weight on reusable symmetry and design intent for the runtime and opt-in features. Summary compaction would require a hook implementing a single method ie. onPreUploadSummaryWithContext.

Nov 08 '22 14:11 dstanesc