element-web icon indicating copy to clipboard operation
element-web copied to clipboard

Element Web takes (up to) 30 seconds to send secrets after successful verification

Open richvdh opened this issue 3 months ago • 14 comments

Steps to reproduce

  1. Observe "Your key storage is out of sync" on EX: Image
  2. Verify a against EW
  3. Wait
  4. Wait a bit longer
  5. Banner is still there
  6. Eventually, banner disappears

Outcome

The problem appears to be that, having received the m.secret.request, EW is waiting for a /sync timeout before it sends the reply.

richvdh avatar Oct 09 '25 15:10 richvdh

It would be better if the secrets were sent as part of the verification exchange (https://github.com/matrix-org/matrix-spec/issues/2209), but we can make this much better in the meantime

richvdh avatar Oct 09 '25 15:10 richvdh

Seems to also happen when EXI is the old device, and ED is the new device, as per https://github.com/element-hq/element-web-rageshakes/issues/31223. But IIRC key sharing is done in the rust crypto crate, so probably handled by the same code in all the clients.

uhoreg avatar Nov 18 '25 03:11 uhoreg

My reading of the element-web and matrix-js-sdk is that they don't send any m.secret.request or m.secret.send to-device events themselves. Tracing where they are sent in matrix-rust-sdk and how that gets called.

andybalaam avatar Dec 12 '25 13:12 andybalaam

matrix-js-sdk calls registerReceiveSecretCallback inside initOlmMachine in rust-crypto, but this is about receiving secrets, not sending them. The docs here say that rust-sdk handles most of the secrets internally.

andybalaam avatar Dec 12 '25 13:12 andybalaam

matrix-js-sdk calls OlmMachine::outgoing_requests on each /sync response to see if there are any "send outgoing to-device messages" requests that need to be sent. I was wondering if there was a situation in which that call happened too quickly after receiving the m.secret.request message, so that then we had to wait for the next iteration of the loop.

richvdh avatar Dec 12 '25 15:12 richvdh

In sync.ts we call processToDeviceMessages which calls client.emit(ClientEvent.ReceivedToDeviceMessage, processedEvent). I could well imagine that this event processing might not be synchronous. Continuing to investigate.

andybalaam avatar Dec 12 '25 15:12 andybalaam

I can't find anyone listening for ClientEvent.ReceivedToDeviceMessage (or the deprecated ToDeviceEvent) so I am guessing I am missing the part where the rust-sdk processes these events without using the event emitting mechanism.

andybalaam avatar Dec 12 '25 15:12 andybalaam

Yep, in preprocessToDeviceMessages we call receiveSyncChanges that ends up in matrix_sdk_crypto_wasm::machine::OlmMachine::receive_sync_changes

andybalaam avatar Dec 12 '25 15:12 andybalaam

Then we receive a secret request, inside matrix_sdk_crypto::gossiping::machine::GossipMachine::receive_event we do:

self.inner.incoming_key_requests.write().insert(request_info, event);

and then return, so we are relying on some later code to read these incoming requests and process them.

andybalaam avatar Dec 12 '25 16:12 andybalaam

We process incoming_key_requests in matrix_sdk_crypto::gossiping::machine::GossipMachine::collect_incoming_key_requests which calls handle_secret_request, which calls share_secret, which writes into outgoing_requests, which is presumably what is picked up by outgoing_requests in matrix-js-sdk.

So the question is: do we call collect_incoming_key_requests within the sync response code?

andybalaam avatar Dec 12 '25 16:12 andybalaam

Yes, we call collect_incoming_key_requests within preprocess_sync_changes, just after the receive_to_device_event calls are made, so this should work.

andybalaam avatar Dec 12 '25 16:12 andybalaam

I am wondering whether these outgoing_requests objects are separate.

andybalaam avatar Dec 12 '25 16:12 andybalaam

matrix_sdk_crypto_ffi::machine::OlmMachine::outgoing_requests calls matrix_sdk_crypto::machine::OlmMachine::outgoing_requests, which calls matrix_sdk_crypto::gossiping::machine::GossipMachine::outgoing_to_device_requests which uses self.inner.outgoing_requests.read().values() to get the same outgoing_requests we pushed to above.

andybalaam avatar Dec 12 '25 16:12 andybalaam

So I'm pretty stuck because it really looks from the rageshake like we pushed an m.secret.send into outgoing_requests and then didn't actually send it until an additional sync request timed out, but I can't see why, because it looks like we wait for all the processing of to-device messages to complete before we send those outgoing requests.

andybalaam avatar Dec 12 '25 16:12 andybalaam

OK, did some investigation with @richvdh this morning, and we have a theory of what happened that we are quite confident in. It looks like when the verification request was sent, there was no existing Olm session with the requesting device, but we needed an Olm session to send the secret.

So we made a /keys/claim request to get OTKs so we could establish an Olm session, and this request went out after the new sync request was sent. After the /keys/claim request was processed, we continued with the secret sending, and queued the secret to be sent, but it wasn't actually sent until the sync request timed out.

The secret send request was queued during the processing of the /keys/claim request, while we were processing the outgoing messages. So we queued a new outgoing request while we were processing outgoing requests, and that didn't get noticed until the next time we finished sync.

We propose to fix this by adding an extra check after processing outgoing messages, so if new messages have appeared during processing, we process them immediately.

andybalaam avatar Dec 15 '25 12:12 andybalaam