text icon indicating copy to clipboard operation
text copied to clipboard

Detect inconsistent initial steps in yjs and require page reload

Open max-nextcloud opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. In case two users start a new yjs session at the same time they need to initialize it with the initial ydoc content to create a consistent world view. https://github.com/nextcloud/text/pull/5589 tackles this. If for some reason the two clients come up with a different initial doc state we end up with a "split brain" situation - where both work on a similar yet different document and fail to sync fully.

Planned solution

  • Push the doc state as soon as we have it to reduce the timespan in which conflicts may occure.
    • Use a new endpoint to save the document state only without the autosave content.
    • Use a separate wrapper in the SessionApi on the client side.
    • Store documentState but no autosaveContent to avoid writing unchanged documents.
  • In case a conflicting documentState already exist:
    • Log the information this happened.
    • On the client handle the error response with an appropriate message and ask the user to reload the page.
  • backport to stable28 - as that's how far #5589 was backported.

Tasks

  • [ ] New API endpoint that only stores the initial document state.
  • [ ] Wrapper in SessionApi for that endpoint.
  • [ ] Test the API endpoint from cypress api tests.
  • [ ] Send the initial document state to the server as soon as we have it.
  • [ ] Handle errors on the client side.

Acceptance Criteria

  • If one client connects a following client will receive an initial yjs document state shortly after (< 5sec.)
  • If two clients connect within the same timeframe and the initial yjs document state matches they continue without interruption.
  • If two clients connect within the same timeframe and the initial yjs document state differs one gets to continue - the other one displays an error message that instructs the user to reload the page.

Open questions

  • Does this interfere in somehow with
    • read only sessions
    • reconnecting
    • reconnecting after a session cleanup

max-nextcloud avatar Apr 23 '24 12:04 max-nextcloud

I think this issue just changed from being theoretical to having been practically observed: grafik

I used this script: https://github.com/nextcloud/text/blob/investigate/inconsistencies-in-document/src/tests/investigate-corruption.spec.js#L41-51

So what the screenshot is printing is tr.afterState - which seems to be the state vector of all clients after processing that transaction.

Two things jump out here:

  • The clock for client 0 increased. This special client ID is used for the initial step. The initial step should be idempotent and only be applied once.
  • The number of clients increased a lot in a single step. Many of the newly connected clients had quite a few steps as well.

So it looks like two editing sessions ran in parallel for a while here before they were 'mixed'. On possible cause for this would be different initial state steps - which would lead to a split brain situation - where one client does not see the other clients changes and vice versa.

max-nextcloud avatar Jun 20 '24 13:06 max-nextcloud