Local Rollback
Scenario
Support a mode in which changes are applied tentatively to a client. Changes are not communicated to other clients unless the transaction is "committed," that is, changes are applied as though they had been made normally. If the work is "aborted," changes are discarded with no observable impact to this client or other clients.
Requirements
- Both inbound and outbound operation are paused while local operations are accrued
- It maybe ok for some operations to have local side effects, like local local contexts for data stores or dds leaking after rollback
- After rollback there should be no persisted changes to the document from either the original local ops, or the rollback.
- All local changes rolled back should provide appropriate events e.g. inserts should emit deletes on rollback
- Changes should be rolled backed in the reverse order they we're created, like undo
- Rollback failure should close the container without any changes being persisted.
API
A good fit for this feature would be support via order sequentially. Such that on unhandled exception, or explicit return behavior all changes accrued within the order sequentially block are rolled back. The major drawback of order sequentially is that no asynchronous actions can be performed, as the callback is synchronous, and meant to align with javascript turns.
If we need to support Local Rollback in the context of asynchronous operations, we would need a more direct API. However, there are a number of asynchronous api's that will not function correctly when paused, so limiting to a synchronous api will make it unnecessary to handle to those cases. For instance, consensus dds, aliasing, blob operations, etc. If we need to support asynchronous actions with rollback this could be coupled with the existing pause/unpause apis, that on unpause one could choose to rollback all local changes.
Design
I propose we model this change after the existing resubmit logic. Resubmit uses the container runtimes pending state manager to enumerate and dispatch local operations in creation order to dds for regeneration. We would add a new method for rollback which enumerates and dispatches local operations in reverse creation order until all necessary local operations are reverted. I've created a prototype here: https://github.com/microsoft/FluidFramework/compare/main...anthony-murphy:rollback-prototype
The revert logic for DDS can be modeled after the existing logic in dds and undo redo packages. A nice side effect of this change is the standardization of our undo redo interfaces and revert functions in the dds packages themselves.
A number of dds today do not store enough local metadata to rollback. For instance, map does not track the previous value, and sequence does not track the previous annotate values, however that information is available in both case and provided for undo redo, so it will need to be added to the local metadata dds provide when they submit and operation.
Additionally, dds may not support local only changes that do not generate operations. This functionally will need to be added such that dds can do local rollback, without the generation of operations.
The bulk of the work here is the per dds rollback implementation
The subtasks, in order:
- [x] #9344
- [x] #9345
- [ ] #9346
- [x] #9347
can we just call this soemething like rollback, local rollback, pending rollback or really anything other than transactions
can we just call this soemething like rollback, local rollback, pending rollback or really anything other than transactions
:) Yeah, this is probably our last chance to change the name before the cement dries on it. "Local rollback" does capture the fact that the rollback happens in isolation. I've also seen the feature referred to as "Revert to marker," which isn't bad. I was trying to think of something that expresses the fact that we're holding these changes in a kind of suspension. "Local staging"?
I think we can call it rollback for short, but maybe something like Local Delta Rollback, or Pending Delta Rollback for the full name. I don't love Revert to Marker, as we don't have a concept of marker, so it's a bit misleading.
Some notes about requirements and implementation. (I'll add to these as time permits):
- Rolling back should have no visible impact on DDS's. (Non-observable impact, such as empty/orphaned DDS that are ultimately GC'd, is okay.)
- Clients other than the one performing the Rollback shouldn't see changes, e.g., content shouldn't appear on other clients and then disappear when rollback happens. The whole operation should be local. That implies pausing outbound messages until accept/rollback happens.
- Should inbound messages be paused until accept/rollback happens? I.e., on accept, should the accepted ops be released all at once, as though they'd all happened at the time "accept" happens? Or should we make some attempt to interleave these ops with remote ops that happened over the course of the operation? Interleaving seems fiendishly complicated. I'd argue for the simple solution.
- If ops are not going to be interleaved with remote ops, we could consider coalescing them, either as they arrive or on "accept." See https://github.com/microsoft/FluidFramework/issues/1912.
- On reject, we'll want to fire "undo" events in the reverse order of the original ops.
- Rollback operations should be considered common/non-exceptional. Partners that make use of them may open many such operations in an ordinary session. As such, perf of operations that result in "accept" should be on the order of the same ops performed without the option of rolling back. "Reject" can be considered exceptional, and a perf hit is acceptable in such a case.
Rollback implementation candidate 1: On starting a rollback-able operation, clone the DDS's in the container and make modifications to the clone. On accept, release the ops, discard the original DDS's, and replace them with the clones. On reject, discard the clones.
Pros:
- Conceptual simplicity
- Cloning is non-trivial to implement, but it's a capability that Fluid may want anyway
Cons:
- Silent clone-and-replace violates item 5 above
- Cloning is likely an expensive operation that would need to be performed on beginning a rollback-able operation, regardless of the outcome (accept/reject), so it violates item 6 above
This is interesting. I had a very different idea about implementation. Basically, dds support a method like resubmit, but instead it tells the dds to rollback/undo the changes. Then rollback is a global operation, likely on container runtime, but maybe delta manager. We'd probably limit it to only being called while outbound is paused, i'm not sure inbound being paused or not really matters that much, but it there is complexity in dds we could start with that limitation as well.
Additionally, i'd like to see this integrated with orderSequentialy in some way such that instead of closing the container we allow rolling back the changes on exception
prototype: https://github.com/microsoft/FluidFramework/compare/main...anthony-murphy:rollback-prototype
@scarlettjlee, @pleath, @vladsud, @DLehenbauer , and @curtisman I added my initial design thoughts above in the main description. Please add any comments to the issue. I think we can dig deeper into the per dds design once we have consensus on the approach.
Both inbound and outbound operation are paused while local operations are accrued
I understood it as: it's okay for ops to be paused. So it's more of a non-requirement that ops are flowing while this is happening. And we choose to do it this way to make it simpler for us. (Unless you're talking about the [outbound] local ops that can be potentially rolled back.)
Changes should be rolled backed in the reverse order they we're created, like undo
Is this a requirement or an implementation detail? Would it be okay to get back to the starting state with some other set of changes that are not specifically the matching sequence of undone changes?
What exactly is happening with the ops in the design? At the end of the process, if no rollback happens, we send the local ops? and if rollback happens we do not send the local ops + rollback ops?
@scarlettjlee Good points.
I think we agree outbound must be paused, otherwise the ops my not be local anymore, so can't be rolled back. I'd originally thought we could allow inbound, but it will make the dds changes harder, as we need to track more information. For instance, in map, if a remote op sets a key that is modified locally, we suppress the remote op as we know the local op will win. That will no longer be the case with rollback. If we go with order sequentially, inbound will also be paused by design, as nothing async can happen. Maybe requirements was not quite the right section name there, but it was meant to convey the things required for this to feature to work as designed here, but we could also loosen those requirements at a later date if it makes sense.
The rollback order is a requirement for the feature. we go to great length the maintain ordering across all our features, as there may be dependencies between changes, like change a is a precondition for change b, and it may not make sense for change b to exist without change a, so in rollback we need to undo things in reverse order that they were done. This is especially important when changes like a and b could be on different dds, or different datastores, which maybe have complex relationships.
We have similar guarantees for resubmit, where we maintain the original order the changes were made on rebase and resubmit. Its not something we do today for resubmit, but it may be possible to do some optimization, like allowing dds to coalesce adjacent ops. For example, if we have ops 1-10, and ops 5-7 all target the same dds, you could give the dds all those ops at once, as we know no interleaved and possibly dependent changes were made, and the dds could coalesce.
for instance, if they were all adjacent inserts in a shared string("c","a","t"), the share string could return a single op that inserts them together ("cat"). But even the dds may have internal limitations. for instance, the eventing model of the shared string does not support combined operations, each event is an insert, delete, or annotate so it could only coalesce local operations of the same type.
If there is no rollback the ops send as normal. After rollback there should be no persisted changes to the document, so the ops are just thrown away, and ideally no ops are created by the rollback process, it is all just local dds changes. On rollback the dds should also clean up whatever state it has that was tracking the original op for acknowledgement.
Some observations / questions / clarifications based on reading this issue:
- "It maybe ok for some operations to have local side effects" - it would be great to clearly spell out what happens if DDS was created and modified in cancelled transaction. I assume attach op would not be undone, but all other ops would be removed.
- Please note that data store creation today includes initialization (i.e. initial "attach" snapshot may contain handles to other data stores / DDSs that were created inside of transaction). Given that data stores can be named (root), they can be visible after rollback, and their state might be in "broken" state. Andrei's work to move to aliasing should help here - we should plan on not supporting creation of named objects by completion of this feature, and aliasing op to be cancelled on rollback.
- I think with this feature it becomes even more important to disallow modifying data model from event callbacks/listeners (basically no reentrancy). It's great change either way.
- RE async APIs (you mentioned consensus data structures, I'd add aliasing here) - I believe we should convert all of them into fire-and-forget + events. This is required for proper support of offline scenarios, and I think it will make this flow simpler. You would still not be able to observe completion of these calls as they require ops on the wire
- In general, I'd say use of consensus data structures should not influence this design. I.e. whether we allow async flow for transactions or not is orthogonal to what we do in this space.
- Pausing inbound ops is a must, if we want to ship it any time soon :) We can consider relaxing it after that feature ships and there is need (and we can estimate amount of work and feasibility)
- Reverse order is a must in current system. That said, I believe this will not be the case in the future - we will have to implement more coarse change notifications, at the level of container (consider result of 3-way merge - the result of it is new document state, and it's impossible to disassemble it into individual atomic changes like we have today; and for perf reasons, under pressure, sooner or later we will have to coalesce ops into more granular updates, similar to how games skip frames if they are not catching up).