matrix-rust-sdk
matrix-rust-sdk copied to clipboard
Multi-process support: Implement finer-grained in-memory caching for the crypto store
Currently, our cross-process lock blows up the entire crypto state (OlmMachine
) whenever we notice another process has written to underlying storage. This was implemented because we really wanted a working implementation for decrypting notifications on iOS, but this really is a shortcut. In particular, short-lived crypto procedures (like verification etc) could be interrupted if another process holds the lock in the middle; or we don't currently have any locking when writing a new message (and this might happen in another process too, see also #1960).
This issue is about fixing and implementing that properly, by having some sort of "cross-process lock guard" representing all the crypto-store data that can be put in a cache. Acquiring this lock may invalidate the cache if another process has written to the underlying database, transparently for the users of the lock. For the point of view of any user of that cache, it's "just" another async
fallible operation.
The idea is to introduce a new CryptoStoreCache
data structure, which maintains two properties:
- it can only be read by a single process at a time (locked behind the cross-process lock),
- if the lock observed a write to the DB by another process, then the whole cache is invalidated (flushed) since we can't rely on it anymore
The overall "migration" process is thus the following:
- consider each data type that's cached in memory throughout the entire crypto codebase,
- for each of these data types, put it into the
CryptoStoreCache
- refactor usage so as to limit the number of places that will try to read it from the database (e.g. identify static data that won't change over the course of the app's lifetime, put it aside so that it doesn't need to be invalidated/reloaded)
Alternatively, we could also drop in-memory caching support for some of these fields, if the performance impact is deemed low enough.
Once that's done for all those fields, the final step would be to move back the crypto-store lock into the OlmMachine
and simplify public APIs again.
A good testing strategy should also be established to make sure we can invalidate the cache at any time, and that it would not fiddle with any of the function calls that may be running in the background.
Tasklist
Bootstrapping
- [x] Implement basic caching support + async guard
- [x] https://github.com/matrix-org/matrix-rust-sdk/issues/2000
- [x] Move one field, e.g. ReadOnlyAccount, so it makes use of the cache
- [x] Make the
ReadOnlyAccount
notClone
able (#2710) - [x] Remove inner mutability from the
ReadOnlyAccount
(#2710)- [x] https://github.com/matrix-org/matrix-rust-sdk/pull/2680
- [x] Make the
- [x] Create transactional API
- [x] https://github.com/matrix-org/matrix-rust-sdk/pull/2660
- [x] Add a
RwLock
on theStoreCache
(#2710)
Migrating
Take each field that's cached and move it to the transactional API and store cache.
- [x] https://github.com/matrix-org/matrix-rust-sdk/pull/3799
- [ ] Move Olm
Sessions
into the cache - [ ] Move
GroupSessionCache
/InboundGroupSession
/OutboundGroupSession
- [ ] #2728
- (more items will come here as I discover them)
The final touch
- [ ] Finish line: move the cross-process crypto lock back into the
OlmMachine
and don't blow up the entire OlmMachine whenever the lock observed another process wrote to the DB - [ ] Remove the
Backups::room_keys_for_room_stream()
method, replace it with theOlmMachine::store()::room_keys_received_stream()
. - [ ] Replace the
Backups::secret_send_event_handler
withOlmMachine::store()::secrets_stream()
. - [ ] Implement a stream in the
OlmMachine
that notifies us about the state of our private cross-signing keys.
Just discussed it with @BillCarsonFr. In the quest of killing UTDs, the crypto team is mostly interested by the Move Olm Sessions into the cache
migration. This is a task they may attack a Q1 but they still want a proof (logs and tests) that one problem is coming from here.
We just released UTDs analytics on EX. We need to wait a bit more to get more data.