FluidFramework
FluidFramework copied to clipboard
Epic: SharedTree DDS
SharedTree DDS
This epic tracks the delivery of a new Fluid DDS for efficiently modeling tree structured data, such as JSON and XML.
Milestones
M0: Concurrent design work for M1 deliverables
By convention, design deliverables are listed in the milestone preceding the dependent implementation work items. Therefore, we've informally introduced M0 to track M1's design dependencies, even though execution on M0 and M1 are happening concurrently.
Deliverables
- #9658
- #9475
M1: Basic Data Synchronization
M1 enables early adopters to begin using SharedTree for transient data synchronization scenarios. The op and storage formats will not yet be finalized and no data migration path will be provided. However, the basic Insert/Delete/Modify operations will be sufficiently robust to preview with non-persistent Fluid sessions.
At this stage, the designs for slices, move and durable Ids are understood, but may not yet be fully implemented. Undo/Redo may at this point not be integrated into the DDS.
Requirements:
- Construct a changeset comprised of insert, delete, and modify operations
- Commit a changeset to the local tree
- Observe changes committed by local and remote clients
Deliverables
- Changeset (operations)
- Insert new subtrees
- Modify node values
- Delete set ranges
- Builder (Accumulation)
- Reader / Writer (Serialization)
- Rebase
- Insert / Delete / Modify
- Squash (if sandwich, else MT)
- In-memory representation
- Apply changeset
- Serialize to changeset
- DDS
- Collab window management
- Rebase management
- Change notification
- Summarization (esp. attach workflow for perf)
- Design
- Move, Slice and Ids are understood
- Forward compatible summarization & op formats
- Understanding of where "lossy" squashes fit into roadmap
- JSON API
M1.5: JSON Editing
M1.5 enables users to collaboratively edit JSONish[^1] data using the SharedTree. M1.5 happens concurrently with the end of M1 and beginning of M2.
JSONish editing is exposed through a specialized API layer that ensures that tree edits are constrained to the JSONish domain. The underlying types and constraints used to implement this are hard coded. Support for custom schemas is planned for a future milestone.
[^1]: JSONish is a pragmatic interpretation of the JSON spec with respect to the JS type system. It includes objects, arrays, strings, float64, bool, and null.
Requirements
- JSONish data losslessly round-trips to/from SharedTree
- Edits are constrained to JSONish domain
Deliverables
- Bijective mapping between SharedTree data model and JSONish
- JSON types:
- map: object
- seq: array and string
- scalar: number (as F64), bool, null
- Schema enforcement
- API
- Reified nodes
- property: get/set
- array/string: insert/remove set/slice
M2: SharedTree MVP
M2 enables general use of SharedTree in persistent Fluid sessions for moderately sized trees of maps and sequences. It also delivers performance and size improvements based on real-world data gathered from early adopters, especially focusing on those that impact the persisted ops and summary formats.
Public APIs are still subject to change, but the tree data model will have settled and code can be mechanically migrated to future versions.
Tree size continues to be limited by client memory and bandwidth. Move and durable Ids are robust, and the operation and summary formats are finalized at v1. The tree may not yet be suitable for editing of high-density data, such as strings and numeric arrays.
Requirements
- Application code can be mechanically migrated to future versions of SharedTree w/o data migration.
- Atomically delete a slice range of nodes.
- Atomically move a set/slice of nodes to a new location in the tree preserving their identity
- Durable unique IDs per node, suitable for permalinks, etc.
- Integrated Undo/Redo
Deliverables
- Changeset:
- Move set
- Move slice
- Delete slice
- Invert
- Perf / size improvements
- Squash (also req. to cancel changes if using sandwich rebase)
- Rebase:
- Move set
- Move slice
- Delete slice
- In-memory representation:
- Id <-> node map
- DDS
- Look up node by id
- Id management
- Undo management
- A stable data model API
- Runtime
- Id compression
- Design
- Summary and constraints are understood
M3: Schema and Constraints
M3 helps developers tame the complexity of their applications by introducing schemas and constraints that reduce the number of incoherent data states that can arise from concurrent edits.
Schemas declaratively state what must be true before and after each operation. Constraints declaratively state what must remain true during the interval between when the client created the edit to when it is sequenced.
In both cases, a conflict prevents the operation from being applied, resulting in remote clients discarding the edit and the local client reverting it and notifying the application.
Requirements
- Declarative schema language
- Supported constraints:
- Subtree unchanged
- Input/output types
Deliverables
- Changeset:
- Encode constraints
- Encode type info
- Encode semantic info (hierarchical edits)
- Constraints
- 'Subtree unchanged'
- Schema
- Parser
- Repository
- DDS
- Conflict management
- Detection
- Revert
- Handler
- Conflict management
M4+: API and Modeling
M4 expands the audience for SharedTree by introducing higher level abstractions.
M5+: History and Branching
M5 enables applications to create and merge branches from arbitrary points in history.
M6+: Partial Checkout
M6 enables a SharedTree to scale beyond the memory and bandwith constraints of the client though partial views of the tree.
M7+: Non-fluid clients (e.g., GraphQL, Open API, etc.)
M7 provides services APIs for isolated reads and writes of the chunked tree storage. This makes it convenient to implement providers for GraphQL, Open API, and other REST-like APIs for non-Fluid clients to access the underlying data.
M8+: Indexed queries
M8 adds service-side support for building and maintaining indices into the tree data.
M9+: Fine-grained access control
M9 provides mechanisms for restricting access to subsets of the tree data.
Hi @DLehenbauer, I'm trying to get a better grasp on the serde side of M1. In your feature, do you plan to allow serializing the whole tree or just the changeset itself? Thanks.
@ungrokkable - We've been working backwards toward API, starting with op and storage formats, so the discussion on the the M1 API is just spinning up. I just summarized the direction we're currently exploring in #8989 for the underlying API, with opportunity to ship an example/experimental higher level convenience API on top what #8989 describes in the same timeframe.
Hi @DLehenbauer, I'm officially requesting a PropertyDDS-like interface for TreeDDS. I realize that #8989 is more for discussing the internal API. Do we have something equivalent for the higher level API? We obviously are flexible and won't require a perfect PropertyDDS interface, but we do want to ensure some of the functionality like binders are there. Thanks, and if I can help with anything, let me know.
For M7/M8, this comment nicely describes what integration with external systems and indexed queries look like at scale.
@taylorsw04 - Notes on pending key decisions to integrate into plan
- Data Model
- Subset of JS, C++, or something else (e.g., Intentional Unified Data Model)
- Which parts of schema are enforced on read vs. write
- High-level API: General flavor and how early to start? (See @ungrokkable's request above)
- Responsibility of low-level engine vs. high-level API
- Who reifies what and when
- Who caches what and when
- Rebase
- Sandwich vs. Scaffolding
- Are we maximizing IDs when available?
My WIP aggregation of the open decisions we have. I plan to build a topology at some point to help focus our efforts, and this is absolutely not complete. I'll keep folding in new information here.
Urgent (either impending or have implications for impending work):
- "Engine"-side in-memory format
- How do we implement it such that neither an immutable nor mutable API takes a perf hit?
- Tree abstraction
- Is the consumer-facing API different from the underlying representation?
- Regardless, what is desired for the public API (JS/bespoke/etc.)?
- If JSON is the abstraction
- How does this change merge resolution?
- How does this affect schema?
- Where can identity appear?
- How does this affect serialization?
- How does extensibility work (in WB model, anything can be added everywhere)?
- If bespoke/WB model is the abstraction
- All scalars are leaf nodes?
- Can scalars lack identity?
- Can traits contain a mix of unboxed scalars and nodes?
- Always reifying arrays vs. not always doing so?
- Allow specialized collections (maps/sets)?
- At leaves only?
- Would they have identities?
- Can identities be present but not usable for lookup?
- Do we expose an immutable or mutable API initially?
- Where is schema? On-read vs. on-write could impact representation.
- Is the consumer-facing API different from the underlying representation?
- Reifying
- Responsibility of low-level engine vs. high-level API
- Who reifies what and when
- Who caches what and when
- Responsibility of low-level engine vs. high-level API
- Rebasing
- Do we want to rebase on the server?
- Should rebasing be temporal onto spatial or spatial onto spatial?
- If you can only rebase via temporal, how do we rebase the anchors that are inputs to the command? Phantom node approach?
- Can we better leverage identity here?
- OT vs. MT/CRDT (aka scaffolding)?
- If OT, exclusion or inversion?
- How would this allow delta-only rebasing?
- If MT, how does it work?
- How would this allow delta-only rebasing?
- Can we leverage identities in the scaffolding?
- Squashing
- How much can we squash intra-commit states?
- How much can we squash inter-commit states?
- How do we want to allow custom/app-specific merge behavior?
- Rebase handler/conflict handler?
- Definitely want re-running of commands
- How can we make this more efficient?
- Constraints
- What types do we want?
Lower-pri:
- Optional sub-transactions
- Do we want them?
- Partial checkout
- How do moves work?
- How do constraints work?
- Range semantics
- "Atomic range": do we want it/when do we want it?
@ruiterr - here is a table summarizing the differences between PropertyDDS and SharedTree when it comes to PropertyDDS constraints. Please let me if I got something wrong or forgot something on the PropertyDDS side.
Operation | Implicit PropertyDDS Constraint | SharedTree Behavior When Constraint Not Met |
---|---|---|
insert child at object field |
field is empty | inserts before or after the existing content |
remove child at object field |
field is not empty | silent no-op |
modify element at object field |
field is not empty | conflicts (drops the modify) |
insert element at array index |
none | n/a |
remove element at array index |
elem is present | silent no-op |
modify element at array index |
index is not empty | conflicts (drops the modify) |
A couple comments on the above:
- In all cases, once SharedTree supports explicit constraints, we'll be able replicate the implicit PropertyDDS constraints exactly. In that sense, the third column only represents the default behavior of shared tree.
- For the "insert child at object field" case, one approach we can use is to craft edits such that the field is guaranteed to be empty by always removing its contents prior to the insert.
- When it comes to constraints, maps behave like objects in both systems.
- When it comes to constraints, strings behave like arrays in both systems.
@yann-achard-MS I think you correctly described the current behaviour of PropertyDDS. At the moment, the conflicting cases are all silently handled. There should actually never be "constraint violations", but in the rebase with respect to the previous operation the CS is rewritten in such a way that it can be applied in a non conflicting way (a duplicate insert is dropped for fields, modification and remove are dropped if there is a parallel remove of the same element). Those cases are regarded as conflicts, but since we currently don't report conclicts to the app, they are just silently resolved.
For the "insert child at object field" case, one approach we can use is to craft edits such that the field is guaranteed to be empty by always removing its contents prior to the insert.
Would this require a slice remove (from beginning to end)? I think it won't be possible via a set remove. Let's assume we have the following situation:
Initial value: A
Modification by client 1: [remove(A, setlike), insert(B, after A)]
.
Modification by client 2: [remove(A, setlike), insert(C, after A)]
.
Let's assumed client 1 is sequenced first. After its change is applied, we have Value: B
If we now rebase the change of client 2, with respect to the change of client 1, the remove(A) will be dropped, because it is a duplicated remove of the same element, the insert will be preserved, so the result of applying this would be Value: C, B
@ruiterr - Thank you for clarifying what happens when the constraints are not met. Here's an attempt at characterizing the difference between the two systems:
Operation | Implicit PropertyDDS Constraint | SharedTree Behavior When Constraint Not Met | PropertyDDS Behavior When Constraint Not Met |
---|---|---|---|
insert child at object field |
field is empty | inserts before or after the existing content | silent no-op |
remove child at object field |
field is not empty | silent no-op | silent no-op |
modify element at object field |
field is not empty | conflicts (drops the modify) | silent no-op |
insert element at array index |
none | n/a | n/a |
remove element at array index |
element is present | silent no-op | silent no-op |
modify element at array index |
index is not empty | conflicts (drops the modify) | silent no-op |
Note that there's a key difference between SharedTree conflicts and PropertyDDS's slient no-ops: while the relevant operation is dropped in both cases, the whole transaction is dropped in SharedTree. To replicate the PropertyDDS behavior, a SharedTree client could choose to only perform one change per transaction (but commit several transactions in bulk). This doesn't seem to be a very efficient workaround as it will foster long chains of more complex rebases. It also doesn't mesh will with hierarchical edits, but I'm guessing current usages of PropertyDDS would not care about them.
I think the first row of the table is the only case for which we would need to use SharedTree's constraint system to be able to reproduce the PropertyDDS outcome. How much of an issue is that (given the work-around slice-delete trick)?
More generally, I'm also curious how you feel about having the current users of PropertyDDS/HFDM adopt SharedTree's model of transactions and constraints as opposed to trying to pretend it's not different. If you don't see it as an improvement, then we want to understand why that may be.
Would this require a slice remove (from beginning to end)?
Yes. That's exactly what I had in mind.
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!