SharedTree DDS

This epic tracks the delivery of a new Fluid DDS for efficiently modeling tree structured data, such as JSON and XML.

Milestones

M0: Concurrent design work for M1 deliverables

By convention, design deliverables are listed in the milestone preceding the dependent implementation work items. Therefore, we've informally introduced M0 to track M1's design dependencies, even though execution on M0 and M1 are happening concurrently.

Deliverables

#9658
#9475

M1: Basic Data Synchronization

M1 enables early adopters to begin using SharedTree for transient data synchronization scenarios. The op and storage formats will not yet be finalized and no data migration path will be provided. However, the basic Insert/Delete/Modify operations will be sufficiently robust to preview with non-persistent Fluid sessions.

At this stage, the designs for slices, move and durable Ids are understood, but may not yet be fully implemented. Undo/Redo may at this point not be integrated into the DDS.

Requirements:

Construct a changeset comprised of insert, delete, and modify operations
Commit a changeset to the local tree
Observe changes committed by local and remote clients

Deliverables

Changeset (operations)
- Insert new subtrees
- Modify node values
- Delete set ranges
- Builder (Accumulation)
- Reader / Writer (Serialization)
Rebase
- Insert / Delete / Modify
- Squash (if sandwich, else MT)
In-memory representation
- Apply changeset
- Serialize to changeset
DDS
- Collab window management
- Rebase management
- Change notification
- Summarization (esp. attach workflow for perf)
Design
- Move, Slice and Ids are understood
- Forward compatible summarization & op formats
- Understanding of where "lossy" squashes fit into roadmap
- JSON API

M1.5: JSON Editing

M1.5 enables users to collaboratively edit JSONish[^1] data using the SharedTree. M1.5 happens concurrently with the end of M1 and beginning of M2.

JSONish editing is exposed through a specialized API layer that ensures that tree edits are constrained to the JSONish domain. The underlying types and constraints used to implement this are hard coded. Support for custom schemas is planned for a future milestone.

[^1]: JSONish is a pragmatic interpretation of the JSON spec with respect to the JS type system. It includes objects, arrays, strings, float64, bool, and null.

Requirements

JSONish data losslessly round-trips to/from SharedTree
Edits are constrained to JSONish domain

Deliverables

Bijective mapping between SharedTree data model and JSONish
JSON types:
- map: object
- seq: array and string
- scalar: number (as F64), bool, null
Schema enforcement
API
- Reified nodes
- property: get/set
- array/string: insert/remove set/slice

M2: SharedTree MVP

M2 enables general use of SharedTree in persistent Fluid sessions for moderately sized trees of maps and sequences. It also delivers performance and size improvements based on real-world data gathered from early adopters, especially focusing on those that impact the persisted ops and summary formats.

Public APIs are still subject to change, but the tree data model will have settled and code can be mechanically migrated to future versions.

Tree size continues to be limited by client memory and bandwidth. Move and durable Ids are robust, and the operation and summary formats are finalized at v1. The tree may not yet be suitable for editing of high-density data, such as strings and numeric arrays.

Requirements

Application code can be mechanically migrated to future versions of SharedTree w/o data migration.
Atomically delete a slice range of nodes.
Atomically move a set/slice of nodes to a new location in the tree preserving their identity
Durable unique IDs per node, suitable for permalinks, etc.
Integrated Undo/Redo

Deliverables

Changeset:
- Move set
- Move slice
- Delete slice
- Invert
- Perf / size improvements
  - Squash (also req. to cancel changes if using sandwich rebase)
Rebase:
- Move set
- Move slice
- Delete slice
In-memory representation:
- Id <-> node map
DDS
- Look up node by id
- Id management
- Undo management
- A stable data model API
Runtime
- Id compression
Design
- Summary and constraints are understood

M3: Schema and Constraints

M3 helps developers tame the complexity of their applications by introducing schemas and constraints that reduce the number of incoherent data states that can arise from concurrent edits.

Schemas declaratively state what must be true before and after each operation. Constraints declaratively state what must remain true during the interval between when the client created the edit to when it is sequenced.

In both cases, a conflict prevents the operation from being applied, resulting in remote clients discarding the edit and the local client reverting it and notifying the application.

Requirements

Declarative schema language
Supported constraints:
- Subtree unchanged
- Input/output types

Deliverables

Changeset:
- Encode constraints
- Encode type info
- Encode semantic info (hierarchical edits)
Constraints
- 'Subtree unchanged'
Schema
- Parser
- Repository
DDS
- Conflict management
  - Detection
  - Revert
  - Handler

M4+: API and Modeling

M4 expands the audience for SharedTree by introducing higher level abstractions.

M5+: History and Branching

M5 enables applications to create and merge branches from arbitrary points in history.

M6+: Partial Checkout

M6 enables a SharedTree to scale beyond the memory and bandwith constraints of the client though partial views of the tree.

M7+: Non-fluid clients (e.g., GraphQL, Open API, etc.)

M7 provides services APIs for isolated reads and writes of the chunked tree storage. This makes it convenient to implement providers for GraphQL, Open API, and other REST-like APIs for non-Fluid clients to access the underlying data.

M8+: Indexed queries

M8 adds service-side support for building and maintaining indices into the tree data.

M9+: Fine-grained access control

M9 provides mechanisms for restricting access to subsets of the tree data.

Nov 16 '21 20:11 DLehenbauer

Hi @DLehenbauer, I'm trying to get a better grasp on the serde side of M1. In your feature, do you plan to allow serializing the whole tree or just the changeset itself? Thanks.

Feb 02 '22 02:02 ungrokkable

@ungrokkable - We've been working backwards toward API, starting with op and storage formats, so the discussion on the the M1 API is just spinning up. I just summarized the direction we're currently exploring in #8989 for the underlying API, with opportunity to ship an example/experimental higher level convenience API on top what #8989 describes in the same timeframe.

Feb 03 '22 02:02 DLehenbauer

Hi @DLehenbauer, I'm officially requesting a PropertyDDS-like interface for TreeDDS. I realize that #8989 is more for discussing the internal API. Do we have something equivalent for the higher level API? We obviously are flexible and won't require a perfect PropertyDDS interface, but we do want to ensure some of the functionality like binders are there. Thanks, and if I can help with anything, let me know.

Feb 10 '22 00:02 ungrokkable

For M7/M8, this comment nicely describes what integration with external systems and indexed queries look like at scale.

Feb 10 '22 19:02 DLehenbauer

@taylorsw04 - Notes on pending key decisions to integrate into plan

Data Model
- Subset of JS, C++, or something else (e.g., Intentional Unified Data Model)
- Which parts of schema are enforced on read vs. write
High-level API: General flavor and how early to start? (See @ungrokkable's request above)
Responsibility of low-level engine vs. high-level API
- Who reifies what and when
- Who caches what and when
Rebase
- Sandwich vs. Scaffolding
- Are we maximizing IDs when available?

Feb 15 '22 17:02 DLehenbauer

My WIP aggregation of the open decisions we have. I plan to build a topology at some point to help focus our efforts, and this is absolutely not complete. I'll keep folding in new information here.

Urgent (either impending or have implications for impending work):

"Engine"-side in-memory format
- How do we implement it such that neither an immutable nor mutable API takes a perf hit?
Tree abstraction
- Is the consumer-facing API different from the underlying representation?
  - Regardless, what is desired for the public API (JS/bespoke/etc.)?
- If JSON is the abstraction
  - How does this change merge resolution?
  - How does this affect schema?
  - Where can identity appear?
    - How does this affect serialization?
  - How does extensibility work (in WB model, anything can be added everywhere)?
- If bespoke/WB model is the abstraction
  - All scalars are leaf nodes?
  - Can scalars lack identity?
  - Can traits contain a mix of unboxed scalars and nodes?
  - Always reifying arrays vs. not always doing so?
  - Allow specialized collections (maps/sets)?
    - At leaves only?
    - Would they have identities?
  - Can identities be present but not usable for lookup?
- Do we expose an immutable or mutable API initially?
- Where is schema? On-read vs. on-write could impact representation.
Reifying
- Responsibility of low-level engine vs. high-level API
  - Who reifies what and when
  - Who caches what and when
Rebasing
- Do we want to rebase on the server?
- Should rebasing be temporal onto spatial or spatial onto spatial?
  - If you can only rebase via temporal, how do we rebase the anchors that are inputs to the command? Phantom node approach?
- Can we better leverage identity here?
- OT vs. MT/CRDT (aka scaffolding)?
  - If OT, exclusion or inversion?
How would this allow delta-only rebasing?
- If MT, how does it work?
How would this allow delta-only rebasing?
Can we leverage identities in the scaffolding?
Squashing
- How much can we squash intra-commit states?
- How much can we squash inter-commit states?
How do we want to allow custom/app-specific merge behavior?
- Rebase handler/conflict handler?
- Definitely want re-running of commands
  - How can we make this more efficient?
Constraints
- What types do we want?

Lower-pri:

Optional sub-transactions
- Do we want them?
Partial checkout
- How do moves work?
- How do constraints work?
Range semantics
- "Atomic range": do we want it/when do we want it?

Feb 15 '22 21:02 taylorsw04

@ruiterr - here is a table summarizing the differences between PropertyDDS and SharedTree when it comes to PropertyDDS constraints. Please let me if I got something wrong or forgot something on the PropertyDDS side.

Operation	Implicit PropertyDDS Constraint	SharedTree Behavior When Constraint Not Met
`insert` child at object field	field is empty	inserts before or after the existing content
`remove` child at object field	field is not empty	silent no-op
`modify` element at object field	field is not empty	conflicts (drops the modify)
`insert` element at array index	none	n/a
`remove` element at array index	elem is present	silent no-op
`modify` element at array index	index is not empty	conflicts (drops the modify)

A couple comments on the above:

In all cases, once SharedTree supports explicit constraints, we'll be able replicate the implicit PropertyDDS constraints exactly. In that sense, the third column only represents the default behavior of shared tree.
For the "insert child at object field" case, one approach we can use is to craft edits such that the field is guaranteed to be empty by always removing its contents prior to the insert.
When it comes to constraints, maps behave like objects in both systems.
When it comes to constraints, strings behave like arrays in both systems.

Mar 18 '22 16:03 yann-achard-MS

@yann-achard-MS I think you correctly described the current behaviour of PropertyDDS. At the moment, the conflicting cases are all silently handled. There should actually never be "constraint violations", but in the rebase with respect to the previous operation the CS is rewritten in such a way that it can be applied in a non conflicting way (a duplicate insert is dropped for fields, modification and remove are dropped if there is a parallel remove of the same element). Those cases are regarded as conflicts, but since we currently don't report conclicts to the app, they are just silently resolved.

For the "insert child at object field" case, one approach we can use is to craft edits such that the field is guaranteed to be empty by always removing its contents prior to the insert.

Would this require a slice remove (from beginning to end)? I think it won't be possible via a set remove. Let's assume we have the following situation:

Initial value: A Modification by client 1: [remove(A, setlike), insert(B, after A)]. Modification by client 2: [remove(A, setlike), insert(C, after A)].

Let's assumed client 1 is sequenced first. After its change is applied, we have Value: B

If we now rebase the change of client 2, with respect to the change of client 1, the remove(A) will be dropped, because it is a duplicated remove of the same element, the insert will be preserved, so the result of applying this would be Value: C, B

Mar 21 '22 10:03 ruiterr

@ruiterr - Thank you for clarifying what happens when the constraints are not met. Here's an attempt at characterizing the difference between the two systems:

Operation	Implicit PropertyDDS Constraint	SharedTree Behavior When Constraint Not Met	PropertyDDS Behavior When Constraint Not Met
`insert` child at object field	field is empty	inserts before or after the existing content	silent no-op
`remove` child at object field	field is not empty	silent no-op	silent no-op
`modify` element at object field	field is not empty	conflicts (drops the modify)	silent no-op
`insert` element at array index	none	n/a	n/a
`remove` element at array index	element is present	silent no-op	silent no-op
`modify` element at array index	index is not empty	conflicts (drops the modify)	silent no-op

Note that there's a key difference between SharedTree conflicts and PropertyDDS's slient no-ops: while the relevant operation is dropped in both cases, the whole transaction is dropped in SharedTree. To replicate the PropertyDDS behavior, a SharedTree client could choose to only perform one change per transaction (but commit several transactions in bulk). This doesn't seem to be a very efficient workaround as it will foster long chains of more complex rebases. It also doesn't mesh will with hierarchical edits, but I'm guessing current usages of PropertyDDS would not care about them.

I think the first row of the table is the only case for which we would need to use SharedTree's constraint system to be able to reproduce the PropertyDDS outcome. How much of an issue is that (given the work-around slice-delete trick)?

More generally, I'm also curious how you feel about having the current users of PropertyDDS/HFDM adopt SharedTree's model of transactions and constraints as opposed to trying to pretend it's not different. If you don't see it as an improvement, then we want to understand why that may be.

Would this require a slice remove (from beginning to end)?

Yes. That's exactly what I had in mind.

Mar 28 '22 17:03 yann-achard-MS

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

Feb 01 '23 17:02 microsoft-github-policy-service[bot]

FluidFramework
FluidFramework copied to clipboard

Epic: SharedTree DDS

SharedTree DDS

Milestones

M0: Concurrent design work for M1 deliverables

Deliverables

M1: Basic Data Synchronization

Requirements:

Deliverables

M1.5: JSON Editing

Requirements

Deliverables

M2: SharedTree MVP

Requirements

Deliverables

M3: Schema and Constraints

Requirements

Deliverables

M4+: API and Modeling

M5+: History and Branching

M6+: Partial Checkout

M7+: Non-fluid clients (e.g., GraphQL, Open API, etc.)

M8+: Indexed queries

M9+: Fine-grained access control

FluidFramework FluidFramework copied to clipboard

Epic: SharedTree DDS

SharedTree DDS

Milestones

M0: Concurrent design work for M1 deliverables

Deliverables

M1: Basic Data Synchronization

Requirements:

Deliverables

M1.5: JSON Editing

Requirements

Deliverables

M2: SharedTree MVP

Requirements

Deliverables

M3: Schema and Constraints

Requirements

Deliverables

M4+: API and Modeling

M5+: History and Branching

M6+: Partial Checkout

M7+: Non-fluid clients (e.g., GraphQL, Open API, etc.)

M8+: Indexed queries

M9+: Fine-grained access control

FluidFramework
FluidFramework copied to clipboard