automerge-classic icon indicating copy to clipboard operation
automerge-classic copied to clipboard

Trade-offs: sparse document structure / syncing many documents?

Open andymatuschak opened this issue 3 years ago • 8 comments

I've been studying the new syncing protocol and storage format you all have been developing, and as an exercise, I was thinking about how one might build a local-first application like Notion.

I see two straightforward approaches: one Doc per page, and one Doc per workspace (where a workspace in Notion is an unbounded collection of pages controlled by some entity like an org).

My first instinct was to design with one Doc per page. But as far as I understand the new syncing protocol, this would make reconciliation of an entire workspace quite costly: you'd have to maintain and resolve a head list for each document, of which there are potentially many thousands. The bookkeeping in the syncing protocol seems to imply a preference for large-granularity documents.

If we use one Doc per workspace, we get efficient syncing, but now we have to load the entire workspace into memory! Operations like load and save require reading and writing every note in the workspace. This approach would seem manageable if the in-memory representations of document collections could be sparse, lazy-loaded, thunkified, etc. It's not obvious to me whether the CRDT algorithms would permit that efficiently.

Anyway, I'm not sure if an issue is the right place to discuss this, so feel free to close or re-route me to a mailing list or whatever, but I'm curious how you all would handle this trade-off.

andymatuschak avatar Apr 22 '21 19:04 andymatuschak

This is a fine place for a conversation like this, though for more synchronous discussion I would also suggest the Automerge Slack.

Our last major automerge project, https://github.com/automerge/pushpin, was built around very granular documents. This had a lot of benefits, but the overhead of syncing many thousands of documents was high. We put quite a bit of work into making this successful in https://github.com/automerge/hypermerge, but I don't think I'd recommend adopting it in its current state for a production project for reasons discussed over on that repo.

One nice thing about that system (vector clocks + append-only hypercores) was that it supported synchronization without materializing the documents and you only had to load documents you cared about. There are a number of interesting challenges as well -- consider search functionality, for example.

I think on the whole there's an art to the granularity of data that is universal. When should you have two JSON documents or two SQLite databases or two rows? My feeling is that an automerge document is best suited to being a granular unit of collaboration. What that means in various contexts is going to require further analysis & thought.

We have a number of ideas about scaling up and synchronizing whole repositories of documents but owing to the length of time since the last major release we agreed to hold off on exploring those for now, but suffice to say that I suspect that there is an efficient way to do for document heads what we've done with commits, it just has a few extra nuances (something, something socialist millionaire).

pvh avatar Apr 22 '21 20:04 pvh

Thanks for that thoughtful reply. I'm looking forward to learning more about socialist millionaires! :)

One nice thing about that system (vector clocks + append-only hypercores) was that it supported synchronization without materializing the documents and you only had to load documents you cared about. There are a number of interesting challenges as well -- consider search functionality, for example.

Right! I think this is a problem for big-single-Docs too: really, anytime you need a relatively expensive index or denormalization. It's something I thought about implementing through the observable API: OK, a change came in which modified subpath X, which now means I need to recompute denormalized index X'. I'm not sure this differs conceptually for the DocSet situation, except ofc that recomputing may involve materialization.

(Since it sounds like this issue is already on your back burner, feel free to close this whenever to keep your issues list hygienic)

andymatuschak avatar Apr 22 '21 21:04 andymatuschak

No, it's good to keep these threads rolling. I think this will be a big focus for me personally soon, so I'm happy to be doing thought-work on it now while we're doing hand-work on the 1.0 release.

One of the first challenges in synchronizing large numbers of documents is that nodes are likely to have overlapping but disjoint documents and neither side wants to disclose things the other doesn't know about (at least in our last system, knowing the ID of a document was evidence a client should have access to it.)

Assuming this is the case, we can approach this like the Socialist Milliionaire Problem. We can exchange document heads without leaking knowledge about documents the other person doesn't have.

It's possible this sharing model is all wrong. Maybe we should have to explicitly request documents from specific peers we know have them, or maybe there's some kind permission graph traversal system... who knows? If you have intuitions here, please share them.

Now, on to full text search -- my ideas here are much earlier in development -- but I have been thinking more about how this scales way up. I don't think there's much utility in competing with something like Google, but I can certainly imagine it would be nice to search across a bigger shared corpus than you have stored on your device. For example, a large Notion might be more than a user wants to download to their phone, or the rendered tile-set for a mapping application could be nice to query.

For these use-cases I imagine some kind of distributed index with version metadata tagged to it so a client can identify freshness. Guaranteeing consistency at scale not really a practical goal, but we can possibly both distribute & share indexing work between participants in the system and possibly get something like eventual consistency.

I think often about other kinds of data processing pipelines where users could benefit from each other's work. Imagine a local-first mapping application where a user could fetch pieces of OSM data and render them to tiles on their local device, or take advantage of rendering tiles en-masse in the cloud if they have access to compute resources, or fetch tiles from a local peer over Bluetooth on a dusty highway somewhere in the back country.

Overall though, this indexing / derived documents question is a research project just waiting to happen, but I think it makes sense to sequence it some time later. There are a bunch of foundational problems we still need to solve with performance and developer experience that prevent developers attempting local-first projects from getting far enough for this to be the thing that does them in. That said, if you (or anyone else) want to take a stab at it, let me know! I'd love to collaborate on something.

pvh avatar Apr 22 '21 22:04 pvh

Oh, dear, I hadn't even thought about potential security issues with larger docsets. My instinct here is to pursue some kind of permission graph traversal, as you suggest, and to stop relying on ID knowledge as security. But this certainly adds complexity!

I'm afraid I'm way out of my depth regarding distributed indexing. Though I've been intrigued by casual looks at The Graph.

I don't have the capacity to push either of these problems forward right now, but I hope the stars do align for a collaboration at some point.

andymatuschak avatar Apr 23 '21 03:04 andymatuschak

I can relate! I think all the problems here are tractable and should be fun to work on, it's just not really time quite yet. I think it would also be useful to consider UnisonLang as at least a conceptual inspiration: content-addressed function calls with a data-distribution system in the runtime.

pvh avatar Apr 23 '21 05:04 pvh

On document granularity, my suggestion would be to think of a document as the unit of sharing. For example, in Google Docs, you can choose who may access a particular document on a per-document level; there are no finer-grained access controls for sections or even paragraphs within a document, so sharing a document is all or nothing. This is a product design question as much as a technical one.

I have some ideas for syncing large collections of documents efficiently. One observation is that in an app like Notion, there are likely to be two types of sync: 1. the initial sync with a new device that has nothing; 2. a subsequent incremental sync in which 99% of documents have not changed. The former needs an efficient bulk transfer, while the latter can be addressed using anti-entropy techniques that were developed for replicated databases. Merkle trees are one way of implementing efficient anti-entropy.

On access control, I've been working with colleagues on protocols that encrypt data such that those users who can decrypt a piece of data are exactly the current set of collaborators at the time that data was sent. Unlike Pushpin, where knowing the ID of a document gives you access to that document forever, our protocol makes it possible to remove collaborators (removed users can still decrypt data from the time they were active collaborators, but they cannot read data that was added after they were removed).

And a small nitpick: I think what you want here is Private Set Intersection, not the Socialist Millionaire problem.

ept avatar Apr 23 '21 09:04 ept

I've been studying the new syncing protocol and storage format you all have been developing, and as an exercise, I was thinking about how one might build a local-first application like Notion.

I see two straightforward approaches: one Doc per page, and one Doc per workspace (where a workspace in Notion is an unbounded collection of pages controlled by some entity like an org).

My first instinct was to design with one Doc per page. But as far as I understand the new syncing protocol, this would make reconciliation of an entire workspace quite costly: you'd have to maintain and resolve a head list for each document, of which there are potentially many thousands. The bookkeeping in the syncing protocol seems to imply a preference for large-granularity documents.

If we use one Doc per workspace, we get efficient syncing, but now we have to load the entire workspace into memory! Operations like load and save require reading and writing every note in the workspace. This approach would seem manageable if the in-memory representations of document collections could be sparse, lazy-loaded, thunkified, etc. It's not obvious to me whether the CRDT algorithms would permit that efficiently.

Anyway, I'm not sure if an issue is the right place to discuss this, so feel free to close or re-route me to a mailing list or whatever, but I'm curious how you all would handle this trade-off.

Hello, what CRDT are you going to use to store document data?

DiamondYuan avatar May 16 '21 04:05 DiamondYuan

const doc = {
    content: 'data',
    children: [{
        content: '1.0',
        children: [{
            data: "1.1",
            children: []
        }, {
            data: "1.2",
            children: []
        }]
    }]
}


type MoveListCRDT<T> = Array<T>
type uuid = string

type RGANode = {
    content: string
    data: MoveListCRDT <Node>;
}

i try to use MoveListCRDT ,but not support move sub tree (such as indent)

DiamondYuan avatar May 16 '21 04:05 DiamondYuan