automerge-classic
automerge-classic copied to clipboard
Clearing History
Any thoughts regarding clearing history and changes to a document leaving only the data? As I understand history should persist since it is needed to resolve possible conflicts. In a collaborative editing scenario, a client edits a document and eventually hits save/close to end the editing session. maybe if there is a node/server that is responsible for authorization and it keeps track of open sessions, and once the sessions < 1, then it clears the metadata. but how to account for a client session that starts and is never closed due to a failure?
It is the nature of a distributed system that you can never guarantee a peer won't turn up with some old state asking to merge it. Perhaps it was on a laptop you left on a shelf for a year. That said, although automerge doesn't currently have any history collapsing capabilities an application developer could always choose to make a new document containing the old document's state.
If you think this is something important I know we've discussed it in the past and a patch would likely be welcome. If you do pursue the project, I would recommend you post your proposed design here for comment first.
I think it's actually a desirable feature to keep as much history as possible, since it can be a useful feature in its own right (a sort of poor man's version control). So I would rather put effort into making it efficient to store the whole history than into clearing the history.
That said, there is one case where it's desirable to remove all history: namely if you want to send a document to a new collaborator, and you don't want to give them all the past versions of the document (perhaps because old drafts contained embarrassing stuff that was later deleted). In this scenario, I can see it being useful to be able to "flatten" a document to contain only its current state, and none of the history. If you want to try implementing that feature, I can give you some pointers for getting started.
Some hooks to manage CRDT garbage collection would be useful. Different kind of policies could be envisaged so perhaps some research and design before coding might be useful.
That said, there is one case where it's desirable to remove all history: namely if you want to send a document to a new collaborator, and you don't want to give them all the past versions of the document (perhaps because old drafts contained embarrassing stuff that was later deleted). In this scenario, I can see it being useful to be able to "flatten" a document to contain only its current state, and none of the history. If you want to try implementing that feature, I can give you some pointers for getting started.
Any pointers on how to do this? I would love to get this feature because currently the saved doc with complete history means I end up sending a lot of data across the wire. Ideally, I would like to keep the complete history on the server, and send only the latest copy to the client when the client loads the first time, and from then on whatever updates the client sends back to the server they are merged into the complete history stored in the server.
I have a strong feeling this feature would help in the implementation of other cases such as forking and channels that you mentioned in #31.
Any further thoughts on this issue? What if there was a "trailing" version that has a snapshot and ids or vectors for the dependencies. any actions that happened before the trailing version can be safely discarded, and new nodes use the trailing version as the starting point.
I think the idea of providing a "shallow clone" is an interesting and reasonable approach, but I don't think anyone has designed it or thought through how it would have to work.
Overall I think it should work, and it seems like a good feature, so if you implement it I at least am happy to give it a review and merge it.
@salzhrani just wondering if you've made any progress on this? I'm attempting to use Automerge with Slate and am beginning to think about this very idea.
To get the ball rolling, here's one idea I've thought about so far:
Idea:
- Have shallow/trailing copies (versions) indexed by the global clock. Every X operations/Y time period, save the current version as the latest snapshot
N-1
and create new current versionN
by initializing the new Automerge document with the latest json. In this, I may use snapshot/history/version interchangeably.
Structural changes:
- Every snapshot of the Automerge document (history) works of the same clock - the global clock. Seems like this is a must-have.
- Regarding DocSet:
- Have DocSet manage the retrieval of the history by changing
this.docs
from aMap:docId->document
toMap:docId->List (of snapshots)
. i.e.this.docs
is be aMap
from thedocId
to aList
ofRecord
s. EachRecord
contains a way to identify the history (timestamp or clock) and the document. - The DocSet by default returns the latest document (thus avoiding the need to change the existing API).
- May need to update the function:
applyChanges
to handle the various cases below. - Create a function to create a new version.
- Have DocSet manage the retrieval of the history by changing
- In Connection:
- Update
maybeSendChanges
to handle the various cases below.
- Update
- In OpSet:
- Hmm... not sure. Hopefully we won't need to make any changes to this.
Different cases to think about:
- When a new client joins or an old client refreshes, the client should send the null clock causing Connection on the other client/server to return the current document
N
(this is what currently happens). - When a client has an old very document with a clock associated with the Mth snapshot with no changes connects/re-joins:
- The client should send over their clock causing Connection on the other client/server to return the current document (version
N
). - Another idea is to send all the changes from the document to the current time (this is what I think currently happens).
- The client should send over their clock causing Connection on the other client/server to return the current document (version
- When a client has old very document with a clock associated with the
M
snapshot with changesC
(this is the tricky part):- The client should send over their clock with the changes.
- Using the clock, Connection on the other client/server retrieves the Automerge document referring to the clock (version M).
- Merge
M + C
withN
. - Get the changes to go from
M+C
toN
using Automerge.diff. - Basically, try to do what Automerge would normally do.. just first create the document that Automerge would have to have first.
- Send the changes back to the client.
Thoughts? I can begin working on this if it seems that it might work.
I'm hoping someone will confirm @vshia's approach above, as I would very much like to see this feature implemented :)
I have a branch that is in progress here: https://github.com/humandx/automerge/pull/1 .
This should handle the first 2 cases above. However, it currently does not handle the 3rd case, which is the more difficult one. Right now, if client is on an older version with changes not on the newer version, those changes are ignored/wiped when the client receives the newer version. I'm currently not sure if it is possible/how to address this case well.
To properly handle versioning of documents, we should figure out a good way to do. Otherwise, this method may not work well for certain network topologies (i.e. p2p).
Has there been any progress since 2018?
We are getting a lot better at storing history efficiently, as discussed in #253, so the need for clearing history entirely is less pressing. Nevertheless, as I have said previously, I understand the need for clearing history for privacy purposes. I think this is something to implement on the performance branch, which is redesigning the internal data structures in preparation for Automerge 1.0.
Oh, I should also mention — one simple way of clearing history is to create a new document with a copy of an existing document's state, e.g. like this:
let clone = Automerge.from(JSON.parse(JSON.stringify(doc)))
However, this new document will not have the ability to merge any subsequent changes made to the old document, or vice versa.