matrix-rust-sdk
matrix-rust-sdk copied to clipboard
OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state)
Sending a message in a room is a three step process:
- Collect all the devices that we don't have an established Olm session with and claim one-time keys for them
- Encrypt a room key for all the devices that are participating in the room and send the key via a to-device message
- Encrypt and send the message
Step 1 and 2 can be later on skipped if all the devices received the room key, sadly step 1 can fail to establish an Olm session if a particular devices has been depleted of their one-time keys, this is fixed by introducing a fallback key but not every device uploads a fallback key yet.
This means that step 1 will be retried for every message we send out if a room has such a depleted device, in test scenarios this isn't terribly slow but the request to claim one-time keys can take a while. This manifests itself as slowness while we send room messages.
The server doesn't notify us if a device uploaded a set fresh of one-time keys so we'll have to introduce a timeout instead.
Remember when we claimed keys for user/device pairs and if a certain timeout didn't pass don't try again, all of this can stay in memory since we don't expect a large amount of dead devices and retrying when we restart the client sounds sensible as well.
Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.
Also, if it fails on message N but we only manage to fetch OTKs and establish an Olm channel on message N+M, we'll only send the room key in the N+M ratcheted state. We should instead send it in state N.
Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.
One flaw with this approach is that you might have the app suspended for quite a while. A significant amount of time would pass. The timeout approach would notice this and retry sooner, while the exponential backoff counting the message numbers would not.
I agree that it would be neat to make this completely deterministic based on the message number, but I don't think that it would produce the desired behavior.
See also https://github.com/vector-im/element-web/issues/24138 of how not to do it in EW
What I implemented, albeit not yet pushed, is an exponential backoff with a max timeout of 15 minutes. That's handling the failures
field for the /keys/query
as well as for the /keys/claim
endpoint. The failures
field gets populated with homeservers that could not have been reached.
The other failure mode that we still need to handle is before this issue can be closed is OTK exhaustion. Sadly it doesn't seem that the server explicitly tells you that the device is gone or that it doesn't have any OTKs. So we can't just reuse the above mentioned solution.
But it's better to retry too often instead of what EW does, especially in this age of fallback keys where OTK exhaustion is a relic of history.
Re-opening since the OTK-exhaustion case is still posing problems and wasn't handled as part of #1315.
Since the server won't tell us which user/device pairs don't have an OTK we'll have to remember the users for which we sent out a /keys/claim
request.
I think that one way to handle this nicely API-wise is to use the following pattern:
struct MissingSessions {
users: BTreeMap<UserId, DeviceId>,
machine: OlmMachine,
}
impl MissingSessions {
pub fn request(&self) -> KeysClaimRequest {
...
}
pub fn mark_request_as_sent(&self, response: &KeysClaimResponse) -> Result<(), CryptoStoreError> {
self.machine.receive_keys_query(self.users, response)
}
}
impl OlmMachine {
pub fn get_missing_sessions(
&self,
impl IntoIterator<Item = &UserId>,
) -> Result<MissingSessions, CryptoStoreError> {
...
}
}
For context here is a rageshake from a user with a device that has no otk (including no fallback).
So every messages he sent will again trigger a keys/claim.
The keys/claim
call can take 50ms, 80ms, 100ms, 200ms and even an occurance at 3793ms (claimMissingKeys took: 3793 ms
)
The problematic device is some EAX build. The keys/claim reply is
{
"one_time_keys": {},
"failures": {}
}
@poljar:
sending the wrong ratchet state
This sounds very bad, and I don't see any explanation of it in the issue. Can you clarify?
@richvdh: It's referring to the contents of this comment.
ah the wrong megolm ratchet. I thought it meant Olm ratchet!
Out of interest I looked at the anatomy of a slow /keys/claim
request. My thoughts over at https://github.com/matrix-org/synapse/issues/16554. TL;DR: synapse could do this faster. Still, that's not really the point for this issue.
I am tempted to write an MSC to modify /keys/claim
to say that it must return an empty dict for any devices which were unknown or which have run out of OTKs. TBD what to do about federation: can we trust remote servers to behave themselves in this respect?
Will think on it a little more.
So I opened an MSC (https://github.com/matrix-org/matrix-spec-proposals/pull/4072), but I think it's actually pointless, and we might as well fix this on the client side.
Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.
I realise the original issue may have been a bit confusing due to bundling these concerns, but we shouldn't be losing information like this.
Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.
Sorry, I didn't realise there was any outstanding work left on this issue.