automerge-classic icon indicating copy to clipboard operation
automerge-classic copied to clipboard

Matrix CRDT

Open kpruden opened this issue 6 years ago • 8 comments

Have you given much though to natively supporting a matrix data structure? That is an n x m table of cells, which supports manipulating rows and columns in such a way that the rectangular structure is maintained.

The obvious way to implement this, a 2-dimensional array, does not work in the face of concurrent structural changes. To see why, consider a simple 2 x 2 matrix:

| a  b |
| 1  2 |

In javascript (or any programming language really), the standard way to represent such a structure is with an array-of-arrays:

[
  [ a, b ],
  [ 1, 2 ]
]

Now, imagine two clients, X and Y are editing this structure concurrently.

Suppose X wants to insert a row in between the two existing rows. Mechanically, this is accomplished by creating a new array of length 2 (the number of columns currently in the table), and inserting into the top-level (rows) array. This produces the following result:

[
  [ a,  1  ],
  [ X1, X2 ],
  [ b,  2  ],
]

In Automerge, this would look something like this:

docX = Automerge.change(docX, doc => {
  doc.rows.splice(1, 0, [ 'X1', 'X2' ]);
});

Meanwhile, suppose client Y inserts a column at the beginning of the table. Mechanically, this is accomplished by inserting an element into each of the existing rows, at the appropriate position, and produces the following result:

[
  [ Y1, a, 1 ],
  [ Y2, b, 2 ],
]

The code for this would look like:

docY = Automerge.change(docY, doc => {
  doc.rows[0].splice(0, 0, 'Y1');
  doc.rows[1].splice(0, 0, 'Y2');
});

When merging these two changes, the result would end up looking like this:

[
 [ Y1, a, 1 ],
 [ X1, X2 ],
 [ Y2, b, 2 ],
]

When in fact the correct result should be more like this:

[
 [ Y1, a, 1 ],
 [ null, X1, X2 ],
 [ Y2, b, 2 ]
]

In my application, I've worked around this problem by representing a matrix as two arrays, one of column IDs and one of row IDs, representing the structure of the matrix, and storing cells separately, with each containing a reference to its row and column ID. My application can then stitch together a temporary 2-dimensional array for presentation, but has to manage structure by manipulating the row/column arrays.

I've done a small amount of searching for any literature on matrix CRDT algorithms, but haven't been able to find much.

When I have more time, I'm willing to take a stab at an implementation, but my theoretical rigor is not strong enough to devise an algorithm on my own, so any pointers would be welcome..

kpruden avatar Jun 21 '18 22:06 kpruden

Hi @kpruden, I was going to suggest exactly the solution that you've already come up with (an array of rows, an array of columns, and a map from (rowId,columnId) to cell contents). I think that's the best way of doing it, and if we wanted to support a matrix structure out of the box in Automerge, I think it would best be done as a wrapper around that structure. I'm not aware of any other definition of a matrix CRDT.

It might be worth exploring if we can define such a matrix wrapper as an extension library outside of the core of Automerge. That way we could keep core Automerge support for a small number of basic datatypes, and have a selection of extensions that provide higher-level datatypes on top of the basic built-in ones.

ept avatar Jun 22 '18 13:06 ept

While thinking about table editing in text documents, I essentially came up with the same solution described above - store rows as a list of guids, store columns as a list of guids, then store a (rowid,colid) -> contents map. I later saw a similar, but more complex approach in Apple Notes.

If you allow row/column reordering, conflicting operations could cause duplicates (two users move the same column at the same time). You'll need some additional rules to deal with this (e.g. ignore all but the greatest timestamp+serverid).

In Notes.app (which is CvRDT based), Apple builds up an OrderedSet CRDT to handle these lists. Their implementation is a little convoluted - a list CRDT contains guids that are not the rowids (the list is actually implemented on top of their richtext CRDT), then there is a map of these guids to rowids, and finally another wrapper with a set of deleted guids. (To deal with delete/move conflicts, I believe.)

They then use an ordered set for rowids, a second one for colids, and a map of colid -> rowid -> text crdt for the table contents.

(None of this is documented, I reverse engineered it so I can backup/export my notes.)

For automerge, an OrderedSet construct might be generally useful. Dunno if it's appropriate for the core library though.

dunhamsteve avatar Jul 08 '18 18:07 dunhamsteve

@dunhamsteve Oh wow, I didn't know Apple's Notes.app used CRDTs! That sounds like a really interesting case study. If you have the time to write up the results of your reverse engineering as a blog post, I'm sure a lot of people would find it very interesting.

What exactly do you mean with an OrderedSet? Is it like a list, but maintaining uniqueness of members?

ept avatar Jul 09 '18 10:07 ept

Yeah OrderedSet would essentially a list of objects maintaining uniqueness. On reflection, it's gonna be hard to get one-size-fits-all on this. It looks like Apple implemented CRSet and then wrapped it with CRTombstonedOrderedSet when they realized they needed tombstones to get their delete semantics.

I left out the clock/CRDT stuff, but I have details of the data side at:

https://github.com/dunhamsteve/notesutils/blob/master/notes.md

Mostly it's (client_id,lamport timestamp) pairs. I haven't worked out all the details of their algorithms, but I'll write something up for posterity and let you know.

A rambling, disorganized preview:

The "topotext" construct is their richtext "CRDT" (I suspect there is a bug in the design). I think it's older than the rest. It's structured as a string, a list of attribute runs, and a list of the data for the CRDT.

I couldn't match up topotext to anything existing, I suspect it's homegrown. It is maybe kinda like RGA, but they keep forward pointers (multiple pointers in the case of conflicts) forming a DAG, kinda like Grishchenko's Causal Trees. The entries have a (lamport_stamp,client_id) pair for the nodes, a separate one for the attributes, and a tombstone flag. Nodes correspond to a run of characters with the lamport timestamp increasing by one for each character. They're split on insert or attribute changes.

I believe I caught it putting a newer node after an older one on conflict, so it may not actually be a proper CRDT.

The drawings and tables are stored in a more generic "Document" construct - a serialization format that reminds me of KeyedArchives - basically a list of objects threaded together by index. There also are lists of fieldname, uuid, and typename to make the encoding a little more compact. Typenames include things like 'com.apple.CRDT.CRTree' which initially pointed me in the CRDT direction.

The actual data are nested dictionaries, registers, topotext, etc. (Somewhat obtusely, they implemented an array on top of their richtext construct, using dummy characters and adding a map of character position -> array entry.)

dunhamsteve avatar Jul 09 '18 22:07 dunhamsteve

I've faced the similar problem recently and my application requires very large matrices - 1M rows and 1M columns for the start then user could add/remove some. Now it is not desireble to "pre create" a such matrix, so initially there are no elements in the editing history and columns/rows/cells are actually gets allocated when user touch some cell. And that is in some way contradicts to what user observes - 1Mx1M empty matrix.

Now the problem arises if 2 users are touching same cell (10, 10) for example. I want it to be the same cell, i.e column 10 and row 10 should get same ID on two disjoined clients. Solution I come up with so far was to use Block Wise RGA, which allows to generate initial history looks like so:

columns: {"uid": "b588"; empty cols block, size = 2^32}
rows: {"uid": "a1d5"; empty rows block, size = 2^32}
cells: []

If clients now try to reffer same col/row from "pre-existing" sets both will "split" same block and come up with same row/col IDs, something like "b588:10" and "a1d5:10". Along side both are still able to insert new rows/columns at any positions and model convergence at the end.

Then I come across automerge and realize algorithm/data structure I was designing is very close to what automerge already does: a tree-like document with update paths, casual deps and UUID per node, e.t.c. Automerge is already written and has except excellent corectness proof (still diving into it :D), so it is worth to try it instead of designing my own lib from the groud up.

And now my questions:

  • What is the authentic way to support block-wise RGA in automerge document? Could I add new CRDT types without touching automerge core?
  • Maybe I could/should solve same problem (pre created array elements) by some other way when is using automerge?

xphoenix avatar May 24 '21 13:05 xphoenix

Hi @xphoenix, adding new CRDT object types or new operation types currently requires changes to automerge core. I'm also not sure whether a block-wise/splitting algorithm is really a much better fit for your use case. Can you use Automerge's existing list datatype, and only insert a row or a column into this list whenever you first write a non-empty cell to that row/column? That way empty rows or columns will take zero space.

If you want to represent a matrix with two non-empty cells at (x=100, y=100) and (x=200, y=100), and everything else empty, one way of representing it might be:

{
  "rows": [{emptyBefore: 99, ID: ID_ROW_100}],
  "columns": [{emptyBefore: 99, ID: ID_COL_100}, {emptyBefore: 99, ID: ID_COL_200}],
  "cells": {
    "ID_COL_100,ID_ROW_100": "cell1",
    "ID_COL_200,ID_ROW_100": "cell2"
  }

Those emptyBefore properties indicate that the list of columns has first 99 blank columns, then one non-blank column with ID ID_COL_100, then another 99 blank columns, then another non-blank column with ID ID_COL_200. If you insert/delete blank columns, you just need to increase or decrease that number.

ept avatar May 25 '21 13:05 ept

Hello @ept, thanks a lot for you attention to my case ;)

Maybe I've defined my problem not clearly, but it sounds like so at the core:

  • There is a huge matrix 1Mx1M of cells
  • Users might reffer any cells
  • Cells are Last Write Win registers
  • And also clients are able to add new columns/rows at any arbitary possition

Disclaimer

Im not a professional in CRDTs, it is still quite new topic for me, so Im honestly sorry for any stupid mistakes/assumptions made below. However maybe someone finds my investigation useful. And most probably any idea or comment will make me realize something new about CRDT :D

First solution

Solution you have suggested above was the first try I've made in order to solve the problem. But then I came to conclusion that it doesn't work properly. Maybe (most probably?) because of my poor understanding of Automerge algorithms, please let me know if so - I would be very happy :D

Here are problems from my point of view, based on my understanding of the Automerge algorithm: If two clients concurrently reffer column 100 then history would be

{ // Initial
  "columns": []

{ // Client1 does lookup, can't found column 100 and generates it
  "columns": [{emptyBefore: 99, ID: ID_COL_100} <= this node has own UUID1],

{ // Client2 does lookup, can't found column 100 and generates it
  "columns": [{emptyBefore: 99, ID: ID_COL_100} <= this node has own UUID2],

Now during the history merge Automerge has no much options as two clients generated two different objects and once changes got merged the final object becomes: "columns": [{emptyBefore: 99, ID: ID_COL_100}, {emptyBefore: 99, ID: ID_COL_100}]. Not desirable outcome. Most probably could be solved somehow by removing duplicates. But seems it is quite complicated as there are "cells" that already "assigned" to some column. A such object graph lays on top of the history graph and whoel thing is definetely something I wouldn't want to reconcile

Also what if clients was reffering different columns?

{ // Client1
  "columns": [{emptyBefore: 9, ID: ID_COL_10} <= this node has own UUID1],
....
{ // Client2
  "columns": [{emptyBefore: 99, ID: ID_COL_100} <= this node has own UUID2],

It is desirable to keep both insertions. Order is important, however from automerge point of view it is all concurrent insertions, so both outcomes listed below are fine:

[{emptyBefore: 9, ID: ID_COL_10}, {emptyBefore: 99, ID: ID_COL_100}]
[{emptyBefore: 99, ID: ID_COL_100}, {emptyBefore: 9, ID: ID_COL_10}]

But it is clearly different from the columns structure perspective. Also one of thouse records should be adjusted during the merge to fix emptyBefore in the context of the previous insertions. Even if order could be "restored" with using ID_COL (10 < 100) doing that and fixing emptyBefore feels a kind of cheating on Automerge core principles and my guess: a such solution will screwed up sooner or later. At least it won't be nice in code, actually I don't even see how to do sort+emptyBefore modifications in terms of Automerge API :(

Am I wrong here?

Logboot based solution

Another solution comes up in my head is a try to use something like logboot:

{
"columns_order": ["id1"],
"columns_allocation": {
  "1": { 
       order: "id1", // that is column "visual" order. BTW could it be a "Pointer"?
        ...some props
   },
...

So now if two clients are reffering same column, lets say "100", they are fighting for the same key in the map, so automerge will keep 1 key instead of multiple (am I right?). Now if client wishes to add new column, after 1st, he could does that by generating a key "1.<client_uuid>" and that would allocate a brand new column other clients could reffer only after receive its history.

Profit. After any column a new one could be inserted, also I have 2^64(32?) "predefined" columns any client could reffer. The price for that is long-long keys and additional UUID per map entry (am I right about how Map does things inside?)

Block-Wise RGA

Constantly growing keys like in logboot looks a bit akward for me. Also it won't get nice compression in a column-based storage. So another solution I came up with was a BlockWise RGA based:

{// inital
"columns" [hole<size = Inf, ID=UUID:clock:0>] // infinite "hole", just a placeholder for indecies

{// client1: reffers predefined column 10
"columns" [hole<size = 9, ID=UUID:clock:0>, column<ID=UUID:clock:10>, hole<size=Inf, ID=UUID:clock:11>]

{// client2: reffers predefined column 10
"columns" [hole<size = 9, ID=UUID:clock:0>, column<ID=UUID:clock:10>, hole<size=Inf, ID=UUID:clock:11>]

{// client3: reffers predefined column 100
"columns" [hole<size = 99, ID=UUID:clock:0>, column<ID=UUID:clock:100>, hole<size=Inf, ID=UUID:clock:101>]

{// client4: adds brand new column after 40
"columns" [hole<size = 40, ID=UUID:clock:0>, column<ID=ClientUUID:ClientClock:0>, hole<size=Inf, ID=UUID:clock:40>]

Merge that 4 histories following BlockRGA algorithm results model that convergence and also keeps both intentions:

  • predefined columns reffered by concurrent clients has the same ID, so content will be recursively merged by Automerge
  • new columns are unique

It seems to be an elegant solution for this problem. Also Automerge could benefits from it by:

  • More compact text representation as metadata payed only for a block of text (I guess it is not actual since column-based storage)
  • Arrays with fixed indexing (it feels more or less like JavaScript holes - is there anyone except me who needs a such thing???)
  • Fun? :)

All benefits are quite dubious but maybe Automerge would be interested in a such functionality? If yes I would appreciate some help to understand what is the best way to add BlockWiseRGA CRDT into Automerge codebase. If no then any better suggestions then Logboot are warmly welcome. If no more options then, I guess, Logboot-style solution is the best I could found :(

xphoenix avatar May 25 '21 21:05 xphoenix

@xphoenix Sorry for the delayed reply. I'm afraid I don't have any better ideas here either. My sense is that very large sparse matrices are a rather specialised datatype that is not required by many applications, so I think you would be better off implementing this outside of Automerge.

Spreadsheets are also sparse matrices, but they generally put their contents in columns starting with A and in rows starting with 1, so for spreadsheets it's feasible to explicitly create list elements for the first n columns and the first m rows, such that these rows and columns contain all the nonempty cells. I would argue that spreadsheet apps therefore don't need the complicated constructions you are proposing.

ept avatar Jul 06 '21 14:07 ept