FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Milan fluid summary compress

Open milanro opened this issue 2 years ago • 6 comments

Note

This Draft PR is the PoC of General Summary Compression Feature. It is not expected to be merged. It is implemented only partially. The main expectation is to review the design and identify the challenges.

Introduction

  • IDocumentStorageService is the interface of the object, through which, the summary is uploaded to / downloaded from Fluid Server.
  • The instance of the IDocumentStorageService is assigned to the ContainerRuntime class member variable _storage in containerRuntime.ts
  • Implementation class of the IDocumentStorageService instance is ContainerStorageAdapter
  • The code uses nodejs Buffer object. Let's discuss whether this is acceptable. It looks that it is used on other places of Fluid framework either.

Methods of IDocumentStorageService interface handling the summary upload / download

  • createBlob
  • readBlob
  • uploadSummaryWithContext
  • getSnapshotTree
  • downloadSummary

Implementation

Note

This is the incomplete implementation. In this first shot, it is not clear, where the methods

  • getSnapshotTree
  • downloadSummary

are called so these methods are not having the decompression functionality implemented (expecting that these methods retrieve summary from server).

Code

Location

Code is implemented in 3 new files and 1 edited file (small change)

  • packages/runtime/container-runtime/src/summaryStorageAdapter.ts
  • packages/runtime/container-runtime/src/summaryStorageCompressionHooks.ts
  • packages/runtime/container-runtime/src/summaryBlobProtocol.ts
  • packages/runtime/container-runtime/src/containerRuntime.ts

Design Overview

  • Implementation based on delegation pattern
  • SummaryStorageAdapter class delegates the calls to underlying IDocumentStorageService instance
  • It also adds calls to preprocessing and postprocessing hooks.
  • Hooks are abstract / pluggable
  • The present implementation of hooks contains the compression algorithms
  • It is however simple to add further hooks such as encryption

Blob protocol

  • This implementation compresses only blobs
  • It is needed that old versions of summaries are still supported
  • We need to add additional info into the blob binary representation which tells us, which compression algorithm is used
  • We also need to support the cases where this info is not present
  • The Blob Header is inserted at the beginning of the blob binary

Blob Header

  • HEADER ID : 16 bytes of unique identifier : f4e72832a5bd47d7a2f35ed47ed94a3c

  • HEADER CONTENT LENGTH : 4 bytes of length of the rest of header content (after the length field)

  • HEADER CONTENT : HEADER CONTENT LENGTH bytes

  • HEADER CONTENT contains key-value pairs

Compression

  • Blob Header contains key 'ALG' and value is number representing the algorithm : NONE = 1, LZ4 = 2, DEFLATE = 3,
  • Compression happens based on the requirement (Algorithm number identifier in the constructor of the hook)
  • The compression algorithm is hardcoded, it could also be assigned as the function into the hook constructor
  • However for decompression there would need to be existing repository of algorithms and number identifiers to choose the proper one on receiving the blob with header
  • If the BlobHeader is not present (f4e72832a5bd47d7a2f35ed47ed94a3c not found at the position 1), no decompression happens on receiving summary, it is passed as received

uploadSummaryWithContext challenges

  • It looks like there no symmetric download method to this upload method (uploadSummaryWithContext)
  • The download is performed via readBlob method (at least in some DDSes)
  • uploadSummaryWithContext processes ISummaryTree object
  • It contains ISummaryBlob objects which are supposed to be compresses
  • ISummaryBlob can contain either binary or string content
  • String content must be converted to binary, UTF encoding is used (is this correct?)
  • Then content is compressed and header is inserted

Main Issue

  • The ISummaryBlob can get binary content
  • When binary content is set it is failing at summaryTreeUploadManager#writeSummaryBlob
  • Assertion happens at comparing different methods to generate hash at assert(hash === blob.sha
    async writeSummaryBlob(content) {
        const { parsedContent, encoding } = typeof content === "string"
            ? { parsedContent: content, encoding: "utf-8" }
            : { parsedContent: Uint8ArrayToString(content, "base64"), encoding: "base64" };
        // The gitHashFile would return the same hash as returned by the server as blob.sha
        const hash = await gitHashFile(IsoBuffer.from(parsedContent, encoding));
        if (!this.blobsShaCache.has(hash)) {
            this.blobsShaCache.set(hash, "");
            const blob = await this.manager.createBlob(parsedContent, encoding);
            assert(hash === blob.sha, 0x0b6 /* "Blob.sha and hash do not match!!" */);
        }
        return hash;
    }

Workaround

  • The workaround is to base64 encode the binary compressed blob content with header
  • The read blob must expect it
  • Read blob searches for header in the binary
  • If not found, it does base64 decode and then searches for header
  • If not found, no decompression happens

summaryStorageAdapter.ts

  • Contains the delegation class of IDocumentStorageService interface which applies preprocessing and postprocessing hooks
  • Contains multi-hook class which can apply various hooks (for example if we want to encrypt and compress)
  • Contains helping functions like retrieving blob / replacing blob from ISummaryTree and traversing ISummaryTree to get blob paths
  • Contains factory function to instantiate SummaryStorageAdapter

summaryBlobProtocol.ts

  • Contains BlobHeader class which can store key/value pairs and generate binary array
  • Contains function which inserts the binary representation of BlobHeader into the binary blob
  • Contain function to read the BlobHeader from the binary blob
  • Contain function to cut off the BlobHeader binary representation from the binary blob

summaryStorageCompressionHooks.ts

  • Contains implementation of hooks
  • Contains compressing of blob
  • Contains decompressing the blob
  • Contains traversing and compressing blobs in ISummaryTree

Questions and Arguments

  • Should the summary compression be configurable? How should we configure the summary compression? IRuntimeOptions? (BaseContainerRuntimeFactory)
  • Can we use the common approach with message compression (specify threshold and algorithm)
  • The compression info is stored in the compressed Blob, do you prefer rather flag at the ISummaryTree? This would probably suppose heavy changes within DDSes as they load Blobs directly (at least SharedPropertyTree does). Also the intercepted methods at IDocumentStorageService such as readBlob do not reach ISummaryTree
  • If the service still want to decompress, is it acceptable to use the Blob header or we need ISummaryTree entry to show algorithm

milanro avatar Aug 19 '22 13:08 milanro

@vladsud @DLehenbauer @dstanesc This draft PR is the partial PoC implementation of the general compression of summaries at the Fluid Framework. It is not fully implemented and is not expected to be merged into main. It is the base for our further discussions as it represents my view, how / where the compression could be implemented. In any case, it is functional at the common use-case (usual summary generation / load)

I believe that it is relevant especially for @vladsud. I would like to ask you to possibly looks at it and open further discussions in this PR.

Just a point that this is based solely on OO, not using functional programming, we can further discuss whether you are ok with it.

milanro avatar Aug 22 '22 09:08 milanro

Maybe also relevant for @justus-camp

milanro avatar Aug 22 '22 09:08 milanro

This is an excellent discussion starter and can work conceptually pretty well with the existing implementation (ie. Property tree summarizer draws efficiency by maximizing blob size to the platform limits hence very good compression rates).

However, once incremental summaries are available, efficiency definition for summary transfer slightly changes. For instance the ideal scenario is when summary blobs equate in size the actual changes.

Summary partitioning algorithm (equally probabilistic or explicit) will rather produce (for typical use-cases) smaller size blobs, to be optimized heuristically per each individual domain/case but likely in the (min 8KiB, avg 16KiB, max 32 KiB) range. Such blobs are not compressing very well.

The design I have in mind would circle back to the original discussion on the differentiation between compression for the data transfer and the compression for the data storage. In a sense this simplifies the problem (not the implementation!) as there are ubiquitous patterns we could apply. One example at hand is http client-server compression, but probably closer to Fluid is the Kafka message compression:

Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression) See doc

Therefore IMHO a canonical design for summary compression would have to include on the client:

  • support for fine granular increments/blobs
  • ability to batch the summary increments/blobs
  • ability to compress the batches of summary increments/blobs
  • ability to partition (fixed size) the compressed batches for efficient transfer

The server will need to provide support for above

dstanesc avatar Aug 23 '22 12:08 dstanesc

Summary partitioning algorithm (equally probabilistic or explicit) will rather produce (for typical use-cases) smaller size blobs, to be optimized heuristically per each individual domain/case but likely in the (min 8KiB, avg 16KiB, max 32 KiB) range. Such blobs are not compressing very well.

I have a feeling that there will always be some optimal threshold and maybe 8k blobs will not be appropriate for let's say 1GB DDS (which would result in 131072 blobs). And the compression is very relevant here also due to reducing space on the server. Maybe that is a strong reason, why blobs prior compression should be bigger.

milanro avatar Aug 23 '22 13:08 milanro

I was touching the topics above, but probably more clarification on the presented viewpoint won't hurt:

I have a feeling that there will always be some optimal threshold and maybe 8k blobs will not be appropriate for let's say 1GB DDS (which would result in 131072 blobs).

As mentioned above, chunk sizes should be correlated w/ domain and use-cases (ie commit size). I see 2 possible scenarios to motivate large summaries, both valid:

  1. Historical growth, possibly by small ops (I/O would benefit from configuring smaller chunking threshold)
  2. The use case produces only large ops (I/O would benefit from configuring larger chunking threshold)

And the compression is very relevant here also due to reducing space on the server.

As mentioned above, I have my mind on the logical differentiation between compression for the data transfer and the compression for the data storage. They are conceptually orthogonal and could be implemented as such. Even more, one would typically have different compression rate preference for in-flight data (speed) vs. data at rest (size)

dstanesc avatar Aug 23 '22 13:08 dstanesc

As mentioned above, I have my mind on the logical differentiation between compression for the data transfer and the compression for the data storage. They are conceptually orthogonal and could be implemented as such. Even more, one would typically have different compression rate preference for in-flight data (speed) vs. data at rest (size)

I expect that it is not foreseen that server would do any further compression and spend the CPU time on any such heavy operation. I'd rather expect that the size of your data stored on the server would influence your price of the service and that it would be in your best interest to keep it as small as possible.

milanro avatar Aug 23 '22 14:08 milanro

@fluid-example/bundle-size-tests: +8.33 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 437.17 KB 441.23 KB +4.06 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 230.47 KB 234.54 KB +4.07 KB
loader.js 153.32 KB 153.32 KB No change
map.js 43.85 KB 43.91 KB +65 Bytes
matrix.js 136.65 KB 136.71 KB +65 Bytes
odspDriver.js 92.06 KB 92.06 KB No change
odspPrefetchSnapshot.js 43.4 KB 43.4 KB No change
sharedString.js 157.19 KB 157.26 KB +65 Bytes
Total Size 1.37 MB 1.38 MB +8.33 KB

Baseline commit: 1c864bbe0ea60a870fe279011ff6d6ba00d36d49

Generated by :no_entry_sign: dangerJS against 59fdce0cc92fc70e84ddf618816f0ac132d13d4a

msfluid-bot avatar Oct 14 '22 11:10 msfluid-bot

@vladsud @DLehenbauer This PR is now ready for review and possible merging.

milanro avatar Oct 20 '22 05:10 milanro

@milanro - Just FYI - We're going to have more nit picky feedback on this PR, but do plan to merge it. You have the fundamentally correctly approach.

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

DLehenbauer avatar Nov 07 '22 23:11 DLehenbauer

@milanro - Just FYI - We're going to have more nit picky feedback on this PR, but do plan to merge it. You have the fundamentally correctly approach.

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

@dstanesc @DLehenbauer The facility for chaining the adapters (or so-called hooks) is implemented in this PR already. Next hook performing the chunking can be added when constructing the storage in containerRuntime. So I believe, either @dstanesc should wait for this PR to be merged or he should build upon it in the new branch.

milanro avatar Nov 08 '22 13:11 milanro

@dstanesc - I was wondering if you were thinking of using the same pattern for chunking? (i.e., have a storage summary adapter that can be chained before or after the compression adapter?)

I believe so. Adopting the pattern for summary compaction would definitely weight on reusable symmetry and design intent for the runtime and opt-in features. Summary compaction would require a hook implementing a single method ie. onPreUploadSummaryWithContext.

dstanesc avatar Nov 08 '22 14:11 dstanesc

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!