matrix-rust-sdk icon indicating copy to clipboard operation
matrix-rust-sdk copied to clipboard

Olm sessions can wedge when using multiple Clients (e.g with NSE process)

Open kegsay opened this issue 1 year ago • 19 comments

I've been looking into bug reports on Element X, and this appears on iOS: (snipped a lot of data)

App:
2024-02-06T07:23:56.293963Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.296398Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.298524Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 
...
2024-02-06T07:23:56.300586Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.301713Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.302390Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 

NSE:
2024-02-06T07:46:24.132801Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:46:24.133377Z DEBUG matrix_sdk_crypto::olm::account: Creating a new Olm session from a pre-key message | crates/matrix-sdk-crypto/src/olm/account.rs:953 | 
2024-02-06T07:46:24.133655Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:3ZuWFzabqiB1bYH1iE/dys5sCxSh8/8KKgo1iNZxsE8")) | crates/matrix-sdk-crypto/src/olm/account.rs:1220 | 
2024-02-06T07:46:24.135368Z INFO matrix_sdk_crypto::session_manager::sessions: Marking session to be unwedged sender_key="curve25519:KWr4ACcEDQ/+VVP7/gHSNGDv1hu9NA4GZZqVckzAgEg" | crates/matrix-sdk-crypto/src/session_manager/sessions.rs:116 | 

The important thing here is that the session was merrily decrypting to-device events fine, until the NSE process took over. At that point, it failed to find the session, and then failed to decrypt the event (due to the sender chain index being >0 I guess) which then marked the Olm session as wedged.

I believe the problem lies in decrypt_olm_message because the NSE process fails to find the session in existing_sessions, implying a cache miss when it should have hit.

kegsay avatar Feb 08 '24 13:02 kegsay

Thanks for opening an issue. Indeed it looks like the session cache in the NSE isn't being reloaded, for some reason, and addressing https://github.com/matrix-org/matrix-rust-sdk/issues/2624 might help with this.

bnjbvr avatar Feb 08 '24 17:02 bnjbvr

@kegsay do you see this line in the Rust logs, out of curiosity?

error when extending lock lease:

bnjbvr avatar Feb 08 '24 17:02 bnjbvr

No I do not see this log line. I did trace the "crypto store mismatch" stuff to see if it was ever wrong, and based on my reading of it, it was always very sane and reasonable.

kegsay avatar Feb 09 '24 12:02 kegsay

Presumably this only affects Element X iOS, because that is the only platform that uses two processes?

richvdh avatar Feb 22 '24 12:02 richvdh

Yes. Web with multiple tabs would also be similarly affected.

kegsay avatar Feb 22 '24 15:02 kegsay

I don't understand why we can't just disable the in-memory cache here. This would mean we exclusively do database hits which would work (sqlite supports multiprocess just fine: https://www.sqlite.org/faq.html#q5 ) If there are performance concerns, can we just profile it and see if that's indeed the case?

kegsay avatar Feb 29 '24 15:02 kegsay

For what it's worth, it's not only about having atomic transactions and enabling the WAL in sqlite. A few reasons off the top of my head:

  1. we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.
  2. even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.
  3. "just disabling the memory cache" doesn't really make sense: it's not a single cache, it's a set of fields in structs scattered across the code base. There was a project to try to reunify those into an actual single cache structure, but this project has been considered low priority by Element's management and thus dropped.
  4. The in-memory caches have been added exactly because of performance reasons, so we'd revert back to a state where we have performance battles we've already fought and won. We might need more benchmarks to prove if a cache is actually required for performance or not.

bnjbvr avatar Feb 29 '24 15:02 bnjbvr

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

kegsay avatar Feb 29 '24 16:02 kegsay

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

As Kegan has said, indexeddb is reasonably resilient to multiple "processes" attempting to write to it at once, although its transaction semantics are annoying and interact poorly with asynchronous code.

However, I don't think we should concern ourselves too much with indexeddb here; multiprocess behaviour on web is not currently an issue due to https://github.com/element-hq/element-web/issues/25157, and right now I think it's more likely that we'll solve multi-tab operation by having a single worker process which is shared between tabs at the JS level than by trying to make concurrent matrix-sdk-crypto processes truly robust. (Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

In other words: if indexeddb support were the only thing worrying us here, I wouldn't count that as a good argument.

richvdh avatar Mar 01 '24 14:03 richvdh

(Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

iOS Share Extension enters the room

(This is another process that will be used to send encrypted messages, for what it's worth.)

bnjbvr avatar Mar 01 '24 14:03 bnjbvr

ugh.

richvdh avatar Mar 01 '24 14:03 richvdh

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

ara4n avatar Mar 02 '24 15:03 ara4n

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

ara4n avatar Mar 03 '24 12:03 ara4n

NSE can send messages in EI today, though accidentally, which is what https://github.com/element-hq/element-ios/issues/7751 is talking about - specifically key sharing requests.

kegsay avatar Mar 04 '24 11:03 kegsay

Context Links: IOS/EIR: share and quick reply are disabled https://github.com/element-hq/element-ios/issues/7618 IOS/EIR: Background machine https://github.com/matrix-org/matrix-ios-sdk/pull/1704

BillCarsonFr avatar Mar 04 '24 12:03 BillCarsonFr

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

For the record, I believe this is currently incorrect: the share extension and quick reply are currently disabled. Though it doesn't entirely change the point here: we need a "proper" fix to that eventually.

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

hrmrmrmrm you may be right. Web locks make implementing the concurrency in the Rust less horrifying than it used to be.

That said: I don't think we've really figured out the semantics for how different processes can call /sync (and then OlmMachine.receive_sync_changes) and OlmMachine.outgoing_requests without fighting with each other. For example: a particular failure mode might be for both tabs to upload the same batch of OTKs: if one of the keys was claimed in the meantime, we could end up with UISIs. (Though, maybe we have the same problem with single tabs, since we do retries on /keys/upload requests.) We solve this in iOS by having the extensions not generate OTKs, which doesn't really work in web because all processes are equal; if you decide to elect a "master" then you're basically back to web workers.

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

richvdh avatar Mar 04 '24 12:03 richvdh

Related NSE problem, causing re-use of megolm index (detected as replay attack) https://github.com/element-hq/element-ios/issues/7499

BillCarsonFr avatar Mar 04 '24 13:03 BillCarsonFr

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

I agree, we may need to be more creative to support multi-tab web (e.g Web Locks) but I don't think this should be a blocker for progress on this issue.

even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.

I think @bnjbvr is saying much the same thing here, that if we want true multi-process then we need additional locks in place.

kegsay avatar Mar 05 '24 15:03 kegsay

It seems like there are still a lot of unknowns here. For example, as well as the case Kegan mentioned in the OP, we've also seen cases where the NSE process appears to reload the OlmMachine from database, and still has a decryption failure. One theory is that outstanding tasks involving the OlmMachine may keep it alive after it's been destroyed, causing the "we have two OlmMachines" again, although exactly how that leads to the observed failure mode remains unclear.

Another theory is that maybe the current cross-process lock is not sound. We're not seeing much evidence of that in the logs though.

richvdh avatar Mar 08 '24 18:03 richvdh

https://github.com/matrix-org/matrix-rust-sdk/issues/3313 is likely a big cause of this.

richvdh avatar Apr 09 '24 15:04 richvdh

@richvdh @andybalaam and myself have identified the root cause on this.

We re-use the SqliteCryptoStore when we regenetate_olm but the store contains a cache called session_cache which never gets updated/changed. I'm not sure how this has ever worked correctly. It seems in theory that the wedging will reliably happen if the following are true:

  • the nse process has started and loaded sessions
  • the app process does something to change the session (e.g sending a message after receiving one)
  • the nse process gets a 2nd message

This matches our minimal working example which does Bob receive -> Alice send -> Bob receive (this wedges the session, this assumes for each message there are new olm messages aka a rotation period of 1 msg). If we remove the cache the test passes.

We hypothesise it may also be the root cause of https://github.com/element-hq/element-ios/issues/7751 and https://github.com/element-hq/element-ios/issues/7480

kegsay avatar Apr 18 '24 07:04 kegsay

@richvdh @andybalaam and myself have identified the root cause on this.

Correction: we have identified one root cause of this, which appears likely to cause a lot of the observed cases. If #3313 is real, it is likely another cause of these symptoms.

richvdh avatar Apr 18 '24 09:04 richvdh

I believe the fix for this is finally released today, in Element X iOS 1.6.7.

richvdh avatar May 13 '24 20:05 richvdh