3-way sync service
Description
As an architect I want to ensure granted consistency between existing user databases, as we can not have simple single source of truth because stack includes third-party products, which relies on their own sources of truth. To be exact, I want reliable consistency between Alkemio user profiles, Kratos identities and Synapse user database, which can not be broken by infinite number of edge cases. This will allow us to have less fragile and more secure system and will allow to reduce amount of code in different parts of system, when we can rely on this consistency.
Acceptance criteria
first line of priority
- [ ] when user identity in Kratos is created, this automatically reflects in creation of user in the Alkemio DB, and this is granted and does not depend on availability of alkemio server at the moment of creation of Kratos identity or any other edge conditions. It is reflected in the Alkemio DB in 100% of cases, no matter what.
- [ ] when user profile is created in Alkemio DB, this automatically reflects in the creation of user in the Synapse, and this does not depend on edge cases, like availability of Synapse etc, and also granted to have this reflection in 100% cases.
second line of priority
- [ ] When user is removed/disabled in Kratos, then all OIDC sessions which belongs to this user are deleted. Granted in 100% of cases, regardless of current situation on the backend.
- [ ] ? Possibly user is also removed from Synapse (subject of discussion)
- [ ] when user is logged out (Kratos session removed), corresponding OIDC session is also removed/invalidated. Granted in 100% of cases, regardless of current situation on the backend.
third line of priority
- [ ] when user change name in Alkemio interface, this gets reflected in the Kratos and, possibly, Synapse (subject of discussion). Granted in 100% of cases, regardless of current situation on the backend.
- [ ] when user change name via Kratos, this gets reflected in the Alkemio DB and, possibly, Synapse (subject of discussion). Granted in 100% of cases, regardless of current situation on the backend.
-
- When user change name in Matrix client, then what? Do we enforce any reflection?
Additional Context
We have next boundary conditions :
- When user is created in Kratos, the only subsystem is granted to work at this time is DB service, which is used by Kratos (otherwise user creation would fail)
- We do not control Kratos code, and we should not touch manually Kratos DB for number of reasons.
- When user is created in the Alkemio DB, the only granted subsystem to work at this time is DB service, which is used by Alkemio server (otherwise user creation would fail).
- We control Alkemio DB and Alkemio server code.
- When something is changed in Synapse, the only granted subsystem to work at this time is DB service, which is used by Synapse (otherwise modifications would fail).
- We don't control Synapse and should we should not touch manually Synapse DB for number of reasons.
- user experience must be non-blocking. I.e. when user logs in for the first time and Synapse is not accessible due to some reasons, user flow should not be blocked waiting for synapse to get back online. Same for Kratos and Alkemio server
Solution:
tiny sync service (working title: Syncer) with, effectively, local to the services/databases persistent queues.
-
Syncer does not touch itself databases of Kratos, Alkemio server and Synapse, but operates via their APIs.
-
As we don't control Kratos, our Syncer should sit next to Kratos in the sidecar, this grants availability of it if Kratos running.
-
Syncer register webhook in Kratos, which is called when new user if added (and later when name, email whatever is changed). All what webhook does it to add Kratos ID of newly created identity to "kratos_outbox" table(which resides on the same DB service with Kratos). This ensure that on Kratos user creation we get reflection in the "kratos_outbox" with nearly 100% probability (same DB same pod etc). another possible design choice is to separate webhook sidecar (this part MUST be sidecar) and Syncer, which then live in separate deployment. Structurally I consider this to be a better solution, although I still would keep syncer and webhook in single repo, just build two containers. extremely tiny sidecar, and tiny syncer.)
-
Syncer has a worker, which "monitors" kratos_outbox table (there are different approaches, depending on the database used, postgres would be better, as it has listeners/notifiers, but mysql also can be done) and when there are new records (one or many) in the "kratos_outbox", it start process of creation of user(s) in the alkemio server (by using internal REST API trigger which is introduced in #5582 ). When and only new user successfully added to the alkemio server, authId is removed from kratos_outbox, if not, use typical flows to ensure it happens as soon as alkemio server is available. This gives as persistent queue independent on state of the system during creation of Kratos identity and granted propagation of such identity to the alkemio server with 100% guaranty. Of course, on start worker always checks is there are some remaining records in the kratos_outbox and handles them.
-
for creation of records in the "alkemio_outbox" table (which resided on the same DB service we have alkemio DB and keeps userIds of newly created users), we have two options (design choice): a) we have a trigger on alkemio DB, which is triggered on adding record to the "user" table, and trigger adds record to "alkemio_outbox" table. b) alkemio server creates record in alkemio_outbox table in the same transaction as creation of user. In both cases, we have 100% guaranty of adding record to the "alkemio_outbox", if user was added to alkemio DB. In general case, I'd prefer b) as better and cleaner solution, but with typeORM a) might be better way.
-
Syncer has another worker which monitors "alkemio_outbox" table, for the records indicating that new user was created in alkemio DB. Then similarly reflects them in the Synapse similar to process in 3) . With 100% guaranty, regardless of conditions.
-
we can also add manual or scheduled triggers for consistency checks and resynchronizations, if there is a edge-case, which, after all, still can not be handled by this approach, as all infrastructure is there and all what those checks need to do is to populate respective outboxes for missing IDs. Rest is taken care automatically.
In this case we always granted to have consistency and users to be present everywhere at any meaningful moment of time.
Cost of development: While it might look complex, it is in reality very simple, trivial and straightforward service, with tiny amount of code. With SDD, most likely, can be finalized over a day. Then there extra work on integration into k8s deployment etc. And some work around in terms of creation of repo, github CI etc (but workflows themself is a part of initial writeup).
Impact:
- this allows us to remove fragility from the system (which is present there now and over last week I've already meet cases when consistency is lost) and increases rigidity where it matters.
- it allows us to write code under assumption of present consistency and reduces amount of communication between server and matrix-adapter, and some code from server and matrix adapter.
- it solves for number of edge cases we currently have uncovered and additional ones we will have when we open Synapse and even more of them, once we allow use of alkemio OIDC provider for more than just Synapse.
Areas that will be affected
To be added during the refinement
Thanks for opening your first issue here! Be sure to follow the issue template!