FluidFramework
FluidFramework copied to clipboard
Prototype to see how hard it is to solve 1Mb Kafka / socket.io limit
For more info, please see description of algorithm in https://github.com/microsoft/FluidFramework/issues/7599 I've created EPIC https://github.com/microsoft/FluidFramework/issues/9023 to track all the issues that were opened related to this problem space (for easier tracking).
This is not finished product, however I've validated it to be working (using samples) The answer I was trying to get it - would it be easier to solve the underlying problem, or work on various mitigations. For example, WB team is asking for addressing infinite reconnection problem that happens on 1Mb limit, which has no direct fix as there is not enough data from socket.io on reason of disconnect.
Work remaining:
- Redesign existing UTs / Add proper UTs for this code
- Re-work flushing mechanism.
- Change DeltaManager layer to flush ops in smaller chunks, i.e. 16 a piece at most.
- Clear DM queue on disconnect - container runtime will resubmit just fine. This will address all the problems we have with size limits
- Break it into stages and submit in parts
- All the code except changes in flushing logic in DM can be shipped together in phase 1.
- Re-think abstractions and reduce leaking of concepts. For example, referenceSequenceNumber for summary should likely be stamped by adapter layer that hides information of seq remapping from summarizer - entities, for most part, should not be aware of it.
Possible Future changes:
- Simplify protocol and remove batch concept out of it.
- Come up with algorithm that allows start sending ops earlier, before batch is complete. Currently I rely on a fact that first message in the batch communicates size of the batch to be able to put more strict correctness asserts. I think we can change that to communicate known head of batch, and keep re-communicating that it's continuation of a batch and more ops are coming. That way ops can move through the system faster, even for big batches.
- We could make some changes to chunking processing, but I'd not touch it (even though we could reduce amount of memory used here, by ensuring that chunked op is always part of a batch, this also removes need to store chunked info in summaries - chunked ops would be always processed in one go).
My guess is that one (with focused time) can have fully ready to go solution in a week. But in order to deliver that feature, we will need ship some code dark first (reading code, and code putting size of batch in metadata) and get to saturation. And only after that ship the remaining code. That's likely will take 1.5 months end-to-end. So likely even in best case, March 15 data that I saw in teams is not achievable.
Thoughts on next steps?
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!