dendrite icon indicating copy to clipboard operation
dendrite copied to clipboard

Sending E2EE messages dendrite -> matrix.org randomly fails to provide keys

Open bones-was-here opened this issue 3 years ago • 22 comments

Background information

  • Dendrite version or git SHA: 5106cc807cf22a95420b24f6bfdd5c9ac8aa06de
  • Monolith or Polylith?: monolith
  • SQLite3 or Postgres?: postgres
  • Running in Docker?: no
  • go version: 1.17.6
  • Client used (if applicable): element android, fluffychat android, element desktop/web, nheko

Description

  • What is the problem: "sometimes" the message recipient is not sent keys to decrypt
  • Who is affected: recipients of E2EE messages sent by dendrite users
  • How is this bug manifesting: recipient sees only a notice that a message could not be decrypted, but from the dendrite user's perspective, all the messages are readable in all clients
  • When did this first appear: not sure this has ever been reliable, sometimes it works fine for a long period, other times it randomly fails even though it's the same session sending the messages (and that client has not been restarted since the last time it was working)

Steps to reproduce

"Recently" I've noticed a new issue where clients will constantly spam the server with key related requests.
I had reliability issues with sending E2EE messages before that, but now element in firefox and element desktop and nheko are constantly making key related requests in an endless loop. I don't know how to open a debug output in android but the CPU load on the phone from running Element is continuous and the network never idles when it's open.

Describe how what happens differs from what you expected.

E2EE should reliably work (and if it can't work there should be an error printed somewhere, not a silent failure that gives the impression the message was sent successfully).

bones-was-here avatar Feb 14 '22 06:02 bones-was-here

I am seen a similar problem, most likely the same as @bones-was-here I have tested this between my test account on matrix.org and my personal and main account on hispagatos.org. I can read what my matrix.org account writes to me on a 1:1 but when I write from hispagatos.org account on the other side I get a "message could not be decrypted" and sometimes other similar ones on the matrix.org side. This is noticeable specially on some rooms we host between people on diverse matrix servers, they have been complaining that they can't see any of the people's msg's that come from our dendrite server. cheers.

r3k2 avatar Feb 16 '22 18:02 r3k2

So this could be either:

  • Dendrite isn't telling clients about all the devices for a user, so the client never encrypts messages for the matrix.org device.
  • Dendrite isn't receiving the device list updates from matrix.org over federation for some reason, e.g https://github.com/matrix-org/dendrite/issues/1601

kegsay avatar Feb 17 '22 12:02 kegsay

du -hs jetstream/\$G/streams/*
40K     jetstream/$G/streams/DendriteInputRoomEvent
40K     jetstream/$G/streams/DendriteOutputClientData
2.5G    jetstream/$G/streams/DendriteOutputKeyChangeEvent
492K    jetstream/$G/streams/DendriteOutputReceiptEvent
56K     jetstream/$G/streams/DendriteOutputRoomEvent
56K     jetstream/$G/streams/DendriteOutputSendToDeviceEvent

Edit: just updated to 5dd203fde3d3b86719354245ac341dbf67fa1851 and at startup:

[INF]   Restored 0 messages for stream "DendriteInputRoomEvent"
[INF]   Restored 0 messages for stream "DendriteOutputClientData"
[INF]   Restored 2,243,602 messages for stream "DendriteOutputKeyChangeEvent"
[INF]   Restored 91 messages for stream "DendriteOutputReceiptEvent"
[INF]   Restored 0 messages for stream "DendriteOutputRoomEvent"
[INF]   Restored 0 messages for stream "DendriteOutputSendToDeviceEvent"
[INF]   Recovering 1 consumers for stream - "DendriteInputRoomEvent"
[INF]   Recovering 1 consumers for stream - "DendriteOutputClientData"
[INF]   Recovering 2 consumers for stream - "DendriteOutputKeyChangeEvent"
[INF]   Recovering 2 consumers for stream - "DendriteOutputReceiptEvent"
[INF]   Recovering 2 consumers for stream - "DendriteOutputRoomEvent"
[INF]   Recovering 2 consumers for stream - "DendriteOutputSendToDeviceEvent"

bones-was-here avatar Feb 18 '22 09:02 bones-was-here

Wow that's worrying. That sounds like:

Dendrite isn't telling clients about all the devices for a user, so the client never encrypts messages for the matrix.org device.

kegsay avatar Feb 18 '22 14:02 kegsay

I think it is encrypting, just with no way for the client to decrypt and no way for the sender to know there was a problem. I reproduced it just now:

Decrypted event source

{
  "type": "m.room.message",
  "content": {
    "msgtype": "m.bad.encrypted",
    "body": "** Unable to decrypt: The sender's device has not sent us the keys for this message. **"
  }
}

Original event source

{
  "content": {
    "algorithm": "m.megolm.v1.aes-sha2",
    "ciphertext": "AwgAEpABuR7Y8lZdqVRhQb2/JkXfenfEZn8XmM79aFkgs8Dy/ylf3LpwZWrQfXZFoIMQGfdyStK7bMG7fUSW2L5GII/Fn2U0tB4vLBheYKBRbRL9Tz+HkzjcXjss7gevFq5sM0e8/IxLPEoYZj7krxGyqzLXkMoFCmtdSHdHmB4o9f22rCfgWclhVsp932V/IRp9nZpJFTr00D/PUXbzUq5fs0rFQf+36GiDRfLn+B7ugbke2gEiWPugL/y1Itlfc5/B/i9URWEcLnukj+E6pWtNr8HpHcqLbbVE0dAB",
    "device_id": "8FJwtH0g",
    "sender_key": "nIaJ1o5X4U1aS5wvqKoC/dFDuv6W6l88M8LcovFhEHY",
    "session_id": "f9RYKueJkaczcLEk09ChFg7SFhhAVUv/5VcJAcjAdeU"
  },
  "origin_server_ts": 1645196307027,
  "sender": "@bones_was_here:xonotic.org",
  "type": "m.room.encrypted",
  "unsigned": {},
  "event_id": "$dc0YYjwa-JrOQ-JeIkl-HemImYs0Xiay4wQIl2HvB1I",
  "room_id": "!F9SctfHbpoz7GzEu:xonotic.org"
}

bones-was-here avatar Feb 18 '22 15:02 bones-was-here

The growth rate of stream "DendriteOutputKeyChangeEvent" is currently averaging about 125 messages per minute.
edit: version is 002429c9e24cc746e0929b41eccbe429f89a6e1f built with golang 1.17.7

bones-was-here avatar Feb 19 '22 16:02 bones-was-here

in our case, I had reports of people on small instances getting our keys and eventually been able to read msgs from users on our server like around 9-10 have reported this, but on matrix.org people never can see any of the msg any one on our server post and we have a bunch of people using dendrite daily.

r3k2 avatar Feb 19 '22 21:02 r3k2

As mentioned in chat, the spam mentioned in OP looks like this (in element 1.10.4 in firefox or element desktop)

OPTIONS /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514990
GET     /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514990
OPTIONS /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514994
GET     /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514994
OPTIONS /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514998
GET     /_matrix/client/r0/sync?filter=1&timeout=30000&since=s174818_110_193336_6191_171832_174803_9514998

It continues like that at several requests per second whenever the client(s) are open, with the same behaviour in all new sessions.
This has been happening with dendrite since about 0.6 including 0.6.4 and doesn't happen when logged into matrix.org.

bones-was-here avatar Feb 23 '22 11:02 bones-was-here

0.6.4 E2EE DM update: no change. With fresh sessions element web is sending un-decryptable messages, element android is sending decryptable messages. Same procedure in both: log in, paste in my e2ee key, successfully restore all keys from backup, sessions show as verified in both clients, all e2ee message history is readable.

bones-was-here avatar Feb 23 '22 13:02 bones-was-here

0.6.4 is tending to log several of these when trying to send e2ee DM:
level=error msg="a.DB.CrossSigningSigsForTarget failed" error="context canceled"

bones-was-here avatar Feb 23 '22 14:02 bones-was-here

And possibly some leftover issue from the OTK cleanup #2208 level=error msg="internal.DeviceListCatchup failed" device_id=0kTVEoUO error="Failed to query OTK counts: context canceled" limit=20 req.id=muHaIF1euT2u req.method=GET req.path=/_matrix/client/r0/sync since=s175549_422_194996_6224_171832_175561_9597270 timeout=30s user_id="@bones_was_here:xonotic.org"

bones-was-here avatar Feb 23 '22 14:02 bones-was-here

I changed log level to debug and the homeserver is printing level=debug msg="QueryKeyChanges request off=11272428,to=11272430 response off=11272430 uids=[@user:server]" context=missing at up to 10 per second, which seems like it could be related to (at least some of) the CPU load in #2181 and the steadily growing DendriteOutputKeyChangeEvent in jetstream.

Currently it's the same two users in all the messages, I know both of them and am confident they're not attacking the homeserver :) One is on matrix.org, the other is on a personal homeserver that also runs synapse.

bones-was-here avatar Feb 25 '22 19:02 bones-was-here

It seems to be Nheko clients causing the QueryKeyChanges spam, and Nheko spams the server much faster than my Element client does. But the fact that both clients are spamming the server (with sync requests) suggests they expect something that Dendrite isn't providing.

bones-was-here avatar Feb 25 '22 19:02 bones-was-here

The spamming is fixed by #2233 but at 8dfc958dddc5f7125633f9b5cf55371ab35ceb6c I still send undecryptable E2EE messages from some sessions to some users (for other users, the same session's E2EE messages are decrypted).

bones-was-here avatar Mar 01 '22 15:03 bones-was-here

Is this still the case on 0.7.0 or main?

neilalexander avatar Apr 05 '22 15:04 neilalexander

@neilalexander I will ask my users later on our private channel see if they still had it, Myself last week someone from matrix .org told me he can't read me. I do git pull and ./build.sh daily so not sure which version it was, but tonight I will try to log in with my test account on matrix.org and ask my users and report back. thanks

r3k2 avatar Apr 05 '22 19:04 r3k2

@neilalexander they say is better but it happens at times, and I just remembed to reply here because I logged into my test account on matrix.org and I just replicated it again today still-matrix-issues

r3k2 avatar Apr 08 '22 15:04 r3k2

There seems to be a pattern to the inability of matrix.org users to decrypt messages sent from Element Web on Dendrite.

If I have not started Element Android (eg rebooted phone) recipient gets:
** Unable to decrypt: The sender's device has not sent us the keys for this message. **
And this is permanent (those messages will have to be re-sent).

If I have started it but it's not currently running (eg phone in sleep) recipient gets: ** Unable to decrypt: Error: OLM.UNKNOWN_MESSAGE_INDEX **
And these will tend to become readable when I start up Element Android and view the room in question.

If I keep the phone plugged in with Android dev options set to always keep the screen awake, and I keep Element as the foreground app (just idling at the room list), E2EE works in Element Web.

bones-was-here avatar May 12 '22 16:05 bones-was-here

I can confirm this issue still with dendrite 0.8.8. I can also confirm that it not only happens with element but also with https://github.com/poljar/weechat-matrix which is based on matrix-nio

I will test if https://github.com/matrix-org/dendrite/releases/tag/v0.8.9

Incoming device list updates over federation are now queued in JetStream for processing so that they will no longer block incoming federation transactions and should never end up dropped, which will hopefully help E2EE reliability

solve this issue.

Update Nope still an issue.

Mic92 avatar Jul 05 '22 07:07 Mic92

I've had unable to decrypt errors with both participants using Android client and connected to "same" Dendrite server. I've just upgraded to latest dendrite via pull request, still can't get the keys. https://github.com/matrix-org/dendrite/issues/2436 https://github.com/vector-im/element-web/issues/19748

jittygitty avatar Jul 26 '22 05:07 jittygitty

Adding some logs for this one on my end since I'm able to confirm that dendrite isn't sending keys to matrix:

Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.519353596Z" level=info msg="Claimed local keys" keys_claimed=1 num_users=1 req.id=PlevnV08Hv0k req.method=POST req.path=/_matrix/client/r0/keys/claim user_id="@timaeos:matrix.nexaeos.io"
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.519466177Z" level=info msg="Claiming remote keys from servers" num_servers=1 req.id=PlevnV08Hv0k req.method=POST req.path=/_matrix/client/r0/keys/claim user_id="@timaeos:matrix.nexaeos.io"
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.519773184Z" level=error msg="ClaimKeys failed" error=" - (retry_after=616.784405ms, blacklisted=false)" req.id=PlevnV08Hv0k req.method=POST req.path=/_matrix/client/r0/keys/claim server=matrix.org user_id="@timaeos:matrix.nexaeos.io"
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.519870513Z" level=info msg="Claimed remote keys" num_keys=0 req.id=PlevnV08Hv0k req.method=POST req.path=/_matrix/client/r0/keys/claim user_id="@timaeos:matrix.nexaeos.io"
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.606177266Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.614311829Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.641433689Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.665190181Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.701291945Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:37 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:37.709796826Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.direct_to_device
Sep 12 06:56:43 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:43.246665128Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=480 MediaID=FGtJYAWqsqrhLwDAHoCVFnqe Origin=matrix.org ResizeMethod=scale Width=640 req.id=OiTwQouejp0i req.method=GET req.path=/_matrix/media/r0/thumbnail/matrix.org/FGtJYAWqsqrhLwDAHoCVFnqe
Sep 12 06:56:43 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:43.456152258Z" level=info msg="Claimed local keys" keys_claimed=5 num_users=1 req.id=uiv5FNZSgFeq req.method=POST req.path=/_matrix/federation/v1/user/keys/claim
Sep 12 06:56:44 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:44.077599291Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:44 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:44.078814483Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:44 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:44.079266462Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:44 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:44.079613241Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:44 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:44.079961145Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:48 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:48.199090647Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=480 MediaID=FGtJYAWqsqrhLwDAHoCVFnqe Origin=matrix.org ResizeMethod=scale Width=640 req.id=C2etXtFi6TT9 req.method=GET req.path=/_matrix/media/r0/thumbnail/matrix.org/FGtJYAWqsqrhLwDAHoCVFnqe
Sep 12 06:56:50 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:50.855145602Z" level=info msg="Sending event" destinations=1 event="$PoKx2OFZ2iX0uqNyqJsoJOwF4L5NOpLmD-naXJK9aSA"
Sep 12 06:56:50 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:50.859857005Z" level=info msg="Sent event to roomserver" event_id="$PoKx2OFZ2iX0uqNyqJsoJOwF4L5NOpLmD-naXJK9aSA" req.id=yMxoJoHf9Q18 req.method=PUT req.path="/_matrix/client/r0/rooms/!DRlmScZdcsyonHFLYF:matrix.org/send/m.room.encrypted/m1662983787804.12" room_id="!DRlmScZdcsyonHFLYF:matrix.org" room_version=9 user_id="@timaeos:matrix.nexaeos.io"
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.723525650Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.724383197Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.725985600Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.726727981Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.727083493Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:56:54 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:56:54.727476081Z" level=info msg="Suppressing send-to-device: originated elsewhere" other_server=matrix.org
Sep 12 06:57:03 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:57:03.819156189Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.receipt
Sep 12 06:57:04 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:57:04.145014099Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.typing
Sep 12 06:57:09 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:57:09.770370715Z" level=info msg="Sending EDU event" destinations=1 edu_type=m.typing
Sep 12 06:57:20 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:57:20.916868012Z" level=info msg="Sending event" destinations=1 event="$sBjWMP5OMT3ARnBCbZqqhU8z1qkdfJa9K4FRWFA9n44"
Sep 12 06:57:20 hermes dendrite-start.sh[1494478]: time="2022-09-12T11:57:20.923273642Z" level=info msg="Sent event to roomserver" event_id="$sBjWMP5OMT3ARnBCbZqqhU8z1qkdfJa9K4FRWFA9n44" req.id=MdTdcjsMbWOC req.method=PUT req.path="/_matrix/client/r0/rooms/!DRlmScZdcsyonHFLYF:matrix.org/send/m.room.encrypted/m1662983829147.16" room_id="!DRlmScZdcsyonHFLYF:matrix.org" room_version=9 user_id="@timaeos:matrix.nexaeos.io"

timaeos avatar Sep 12 '22 12:09 timaeos

Just a quick update: I updated to the latest commit After some work clearing out old devices and redoing the device verification on the dendrite server, I'm now only having this problem intermittently in the opposite direction (i.e. dendrite -> matrix e2ee sends keys but matrix -> dendrite doesn't).

It started out failing to send the keys upon the room's creation. I tried to recreate it by leaving the room and creating a new one, however after one room successfully exchanged keys, all subsequent room creations with the same participants successfully exchanged keys. Maybe it has to do with stale keys if there is some time factor to it or maybe a race condition?

timaeos avatar Sep 16 '22 16:09 timaeos

I discussed the issue with a user in the Dendrite chat and was told to run some DB commands that "Instructs Dendrite to re-new the device lists/keys of the given user, so hopefully your client now received a device_list.changed entry for that user and encrypts messages successfully to that user"

I ran these commands on my postgres server, which seems to have resolved the issue for now: SELECT * FROM keyserver_stale_device_lists WHERE user_id = 'MATRIX_USER_ID'; UPDATE keyserver_stale_device_lists SET is_stale = true WHERE user_id = 'MATRIX_USER_ID';

this seems to have helped with the issue. I am not sure how this could be worked into fixing the main issue, but thought it relevant.

5trongthany avatar Sep 29 '22 20:09 5trongthany

@S7evinK appears to be working on making an e

I discussed the issue with a user in the Dendrite chat and was told to run some DB commands that "Instructs Dendrite to re-new the device lists/keys of the given user, so hopefully your client now received a device_list.changed entry for that user and encrypts messages successfully to that user"

I ran these commands on my postgres server, which seems to have resolved the issue for now: SELECT * FROM keyserver_stale_device_lists WHERE user_id = 'MATRIX_USER_ID'; UPDATE keyserver_stale_device_lists SET is_stale = true WHERE user_id = 'MATRIX_USER_ID';

this seems to have helped with the issue. I am not sure how this could be worked into fixing the main issue, but thought it relevant.

@S7evinK appears to be working on an admin endpoint to streamline this procedure without requiring a dendrite server reboot (#2746 ) and messing around in the db.

The most recent commits that deal with parsing commit e1bf709 appears to have helped with e2ee as well. If anyone is still showing this problem, please try updating to the latest commit

timaeos avatar Sep 29 '22 23:09 timaeos

With a recent change /members, is this still happening? Is it better/worse?

S7evinK avatar Nov 04 '22 12:11 S7evinK

e2ee no longer works at all for me, in rooms where it did previously work, even between 2 users of the same dendrite instance.

bones-was-here avatar Nov 04 '22 13:11 bones-was-here

e2ee no longer works at all for me, in rooms where it did previously work, even between 2 users of the same dendrite instance.

Which commit are you on? We've been hearing the opposite overall from most users, that things have improved dramatically since 0.10.6 and the latest main commits.

neilalexander avatar Nov 04 '22 13:11 neilalexander

0.10.6 currently. It got worse some time around 0.8.x or 0.9.x (rooms that used to work most of the time started breaking, the workaround of having all clients open simultaneously no longer helped) and currently seems broken all the time. Very old messages from when it used to work can still be decrypted.

The messages that were sent before the invited user joined are visible now (after the usual clear cache & reload dance) so DM works better in that regard, but those also can't be decrypted if e2ee is enabled.

bones-was-here avatar Nov 04 '22 13:11 bones-was-here

Not sure if relevant to /members but on 0.10.6 when joining a DM via an invite, i still hit the incorrect users counting issue #2514 (it showed a total of 0 people but correctly listed 2). Most rooms have correct member counts but our main room member count still shows an extra 12 users compared to synapse (even on the dendrite that originally created the room).

bones-was-here avatar Nov 04 '22 13:11 bones-was-here

/members is unrelated to #2514, creating a test for this now.

S7evinK avatar Nov 04 '22 13:11 S7evinK