element-meta icon indicating copy to clipboard operation
element-meta copied to clipboard

Users whose servers were unreachable will receive undecryptable messages due to failed OTK claim

Open richvdh opened this issue 1 year ago • 7 comments

  • Alice tries to send a message in a room that includes Bob.
  • Bob's server is offline; Alice's OTK claim therefore times out. Alice sends the message anyway without sharing the key with Bob.
  • Later, Bob comes back on line. He receives the room message but not the keys.

Even if Alice subsequently sends another message using the same session, and tries again to share the session key with Bob, it is likely that she will share the megolm ratchet starting at that second message rather than the first one.

Bob will never be able to decrypt the message.


Tasks, with T-shirt sizes

Spec side:

  • [ ] Update MSC4081; we need to add unstable prefixes (S)

Server side:

  • [ ] Fix https://github.com/element-hq/synapse/issues/11374 (M)
  • [ ] Extend /keys/upload impl and e2e_fallback_keys_json table to record "eager_share" flag (S). Remember to add to synapse_port_db.
  • [ ] Trigger m.device_list_update when fallback keys are updated (S)
  • [ ] Include details of fallback_keys in m.device_list_update EDU (L)
  • [ ] When we receive fallback_keys in m.device_list_update EDU, stash them in e2e_fallback_keys_json (or do we need a separate table?) (L)
  • [ ] Update /keys/claim implementation not to set used flag on eager_share keys, in both sqlite and postgres impls (S)
  • [ ] Update /keys/claim implementation to fall back to the local store when the remote server is inoperative.

matrix-sdk-crypto:

  • [ ] Keep old fallback keys around for longer (M).
  • [ ] Ignore device_unused_fallback_key_types in /sync, and instead rotate keys when the current one is old, or has been used (M).
  • [ ] Set eager_share_fallback_keys flag in /keys/upload request (S)

Testing:

  • Write a complement-crypto test for this scenario (L)

richvdh avatar Oct 19 '23 21:10 richvdh

Duplicate of #2153

richvdh avatar Oct 19 '23 21:10 richvdh

Actually I think this is clearer than #2153, so closing the other.

richvdh avatar Jan 12 '24 17:01 richvdh

https://github.com/matrix-org/matrix-spec-proposals/pull/4081 proposes a way to fix this.

richvdh avatar Jan 12 '24 17:01 richvdh

To port some of the possible solution thoughts from #2153:

  • Alice's client should maintain a persisted queue of not-yet-set-up-Olm sessions, and retry
  • Alice's server could nudge Alice's client (e.g. by push) if it spots that Bob's server has come back, so Alice's client can retry setting up Olm.
  • MSC4081 is all very well, but it doesn't provide a full solution - you still have the problem that if Bob caches a stale fallback key for Alice, then the session won't set up, and Bob will need to be nudged by his server once it learns that Alice's devicelist has changed - c.f. https://github.com/matrix-org/matrix-spec-proposals/pull/4081/files#r1451581648

ara4n avatar Jan 13 '24 17:01 ara4n

We'll need to:

  • Deal with device-list-update bugs such as https://github.com/element-hq/synapse/pull/16875
  • Push forward MSC4081, including:
    • Grokking @ara4n's feedback above
    • Server-side changes
    • Client-side changes?
  • Complement-crypto test

richvdh avatar Feb 26 '24 15:02 richvdh

@pmaier1 To check priority given it's happening in not common use cases

BillCarsonFr avatar Mar 14 '24 14:03 BillCarsonFr

We concluded that this has low priority as we consider the impact as "low" (only subject to very specific cases) and the effort to fix as "high".

pmaier1 avatar Mar 21 '24 12:03 pmaier1