matrix-rust-sdk OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state)

Sending a message in a room is a three step process:

Collect all the devices that we don't have an established Olm session with and claim one-time keys for them
Encrypt a room key for all the devices that are participating in the room and send the key via a to-device message
Encrypt and send the message

Step 1 and 2 can be later on skipped if all the devices received the room key, sadly step 1 can fail to establish an Olm session if a particular devices has been depleted of their one-time keys, this is fixed by introducing a fallback key but not every device uploads a fallback key yet.

This means that step 1 will be retried for every message we send out if a room has such a depleted device, in test scenarios this isn't terribly slow but the request to claim one-time keys can take a while. This manifests itself as slowness while we send room messages.

The server doesn't notify us if a device uploaded a set fresh of one-time keys so we'll have to introduce a timeout instead.

Remember when we claimed keys for user/device pairs and if a certain timeout didn't pass don't try again, all of this can stay in memory since we don't expect a large amount of dead devices and retrying when we restart the client sounds sensible as well.

Jun 22 '21 10:06 poljar

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

Oct 20 '21 11:10 dkasak

Also, if it fails on message N but we only manage to fetch OTKs and establish an Olm channel on message N+M, we'll only send the room key in the N+M ratcheted state. We should instead send it in state N.

Oct 20 '21 11:10 dkasak

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

One flaw with this approach is that you might have the app suspended for quite a while. A significant amount of time would pass. The timeout approach would notice this and retry sooner, while the exponential backoff counting the message numbers would not.

I agree that it would be neat to make this completely deterministic based on the message number, but I don't think that it would produce the desired behavior.

Dec 22 '22 11:12 poljar

See also https://github.com/vector-im/element-web/issues/24138 of how not to do it in EW

Jan 02 '23 19:01 ara4n

What I implemented, albeit not yet pushed, is an exponential backoff with a max timeout of 15 minutes. That's handling the failures field for the /keys/query as well as for the /keys/claim endpoint. The failures field gets populated with homeservers that could not have been reached.

The other failure mode that we still need to handle is before this issue can be closed is OTK exhaustion. Sadly it doesn't seem that the server explicitly tells you that the device is gone or that it doesn't have any OTKs. So we can't just reuse the above mentioned solution.

But it's better to retry too often instead of what EW does, especially in this age of fallback keys where OTK exhaustion is a relic of history.

Jan 02 '23 19:01 poljar

Re-opening since the OTK-exhaustion case is still posing problems and wasn't handled as part of #1315.

Since the server won't tell us which user/device pairs don't have an OTK we'll have to remember the users for which we sent out a /keys/claim request.

I think that one way to handle this nicely API-wise is to use the following pattern:

struct MissingSessions {
     users: BTreeMap<UserId, DeviceId>,
     machine: OlmMachine,
}

impl MissingSessions {
    pub fn request(&self) -> KeysClaimRequest {
        ...
    }

    pub fn mark_request_as_sent(&self, response: &KeysClaimResponse) -> Result<(), CryptoStoreError> {
        self.machine.receive_keys_query(self.users, response)
    }
}

impl OlmMachine {
    pub fn get_missing_sessions(
        &self,
        impl IntoIterator<Item = &UserId>,
    ) -> Result<MissingSessions, CryptoStoreError> {
        ...
    }
}

May 26 '23 10:05 poljar

For context here is a rageshake from a user with a device that has no otk (including no fallback). So every messages he sent will again trigger a keys/claim. The keys/claim call can take 50ms, 80ms, 100ms, 200ms and even an occurance at 3793ms (claimMissingKeys took: 3793 ms)

The problematic device is some EAX build. The keys/claim reply is

{
    "one_time_keys": {},
    "failures": {}
}

May 26 '23 11:05 BillCarsonFr

@poljar:

sending the wrong ratchet state

This sounds very bad, and I don't see any explanation of it in the issue. Can you clarify?

Oct 16 '23 10:10 richvdh

@richvdh: It's referring to the contents of this comment.

Oct 16 '23 10:10 dkasak

ah the wrong megolm ratchet. I thought it meant Olm ratchet!

Oct 16 '23 11:10 richvdh

Out of interest I looked at the anatomy of a slow /keys/claim request. My thoughts over at https://github.com/matrix-org/synapse/issues/16554. TL;DR: synapse could do this faster. Still, that's not really the point for this issue.

Oct 26 '23 11:10 richvdh

I am tempted to write an MSC to modify /keys/claim to say that it must return an empty dict for any devices which were unknown or which have run out of OTKs. TBD what to do about federation: can we trust remote servers to behave themselves in this respect?

Will think on it a little more.

Oct 26 '23 13:10 richvdh

So I opened an MSC (https://github.com/matrix-org/matrix-spec-proposals/pull/4072), but I think it's actually pointless, and we might as well fix this on the client side.

Nov 01 '23 15:11 richvdh

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

I realise the original issue may have been a bit confusing due to bundling these concerns, but we shouldn't be losing information like this.

Nov 20 '23 14:11 dkasak

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

Sorry, I didn't realise there was any outstanding work left on this issue.

Nov 20 '23 14:11 richvdh

matrix-rust-sdk matrix-rust-sdk copied to clipboard

OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state)

matrix-rust-sdk
matrix-rust-sdk copied to clipboard