Detect inconsistent initial steps in yjs and require page reload
Is your feature request related to a problem? Please describe. In case two users start a new yjs session at the same time they need to initialize it with the initial ydoc content to create a consistent world view. https://github.com/nextcloud/text/pull/5589 tackles this. If for some reason the two clients come up with a different initial doc state we end up with a "split brain" situation - where both work on a similar yet different document and fail to sync fully.
Planned solution
- Push the doc state as soon as we have it to reduce the timespan in which conflicts may occure.
- Use a new endpoint to save the document state only without the autosave content.
- Use a separate wrapper in the
SessionApion the client side. - Store
documentStatebut noautosaveContentto avoid writing unchanged documents.
- In case a conflicting
documentStatealready exist:- Log the information this happened.
- On the client handle the error response with an appropriate message and ask the user to reload the page.
- backport to stable28 - as that's how far #5589 was backported.
Tasks
- [ ] New API endpoint that only stores the initial document state.
- [ ] Wrapper in
SessionApifor that endpoint. - [ ] Test the API endpoint from cypress api tests.
- [ ] Send the initial document state to the server as soon as we have it.
- [ ] Handle errors on the client side.
Acceptance Criteria
- If one client connects a following client will receive an initial yjs document state shortly after (< 5sec.)
- If two clients connect within the same timeframe and the initial yjs document state matches they continue without interruption.
- If two clients connect within the same timeframe and the initial yjs document state differs one gets to continue - the other one displays an error message that instructs the user to reload the page.
Open questions
- Does this interfere in somehow with
- read only sessions
- reconnecting
- reconnecting after a session cleanup
I think this issue just changed from being theoretical to having been practically observed:
I used this script: https://github.com/nextcloud/text/blob/investigate/inconsistencies-in-document/src/tests/investigate-corruption.spec.js#L41-51
So what the screenshot is printing is tr.afterState - which seems to be the state vector of all clients after processing that transaction.
Two things jump out here:
- The clock for client 0 increased. This special client ID is used for the initial step. The initial step should be idempotent and only be applied once.
- The number of clients increased a lot in a single step. Many of the newly connected clients had quite a few steps as well.
So it looks like two editing sessions ran in parallel for a while here before they were 'mixed'. On possible cause for this would be different initial state steps - which would lead to a split brain situation - where one client does not see the other clients changes and vice versa.