synapse Device lists are desynchronizing

Description

People seem to be seeing/getting outdated device lists from Synapse regardless of server. This is manifesting in the fairly normal ways: wrong names, devices that should have been deleted, etc. New devices seem to be included correctly all the time, but usually with a mix of incorrect devices.

insert into device_lists_remote_resync (user_id, added_ts) values ('@victim:example.org', (extract(epoch from now())::bigint * 1000)); fixes the problem for the relevant victims.

Appears to be a wider issue than just matrix.org

Known cases:

Someone deleted their extra devices, but no one saw that.
Someone cross-signed their new device, but no one believed them.
Someone renamed their device, but no one saw that.

The rename trick for trying to confuse/convince the server into updating doesn't appear to work anymore either - only the database insert functions.

Started appearing months ago.

Oct 22 '20 14:10 turt2live

Is this a duplicate / superset of #8459?

Oct 22 '20 14:10 clokep

ah ha, I did report it at the time - I'm so smart. I'd close https://github.com/matrix-org/synapse/issues/8459 in favour of this as this is the more general version of "it's all broken", but I'll leave that up to you.

Oct 22 '20 14:10 turt2live

related: #4827?

Nov 02 '20 21:11 richvdh

We have a similar issue. We have a bot with 1500 devices (for E2EE performance testing). Its device list only shows up on some servers though. This may be easier to reproduce with such bigger device lists?

Dec 16 '20 14:12 deepbluev7

Just ran into this myself, a friend on matrix.org had about 480 devices, deleted all of them except 4 and my homeserver reported 318 devices still being around. Running the insert query from the OP fixed it for me.

Dec 21 '20 12:12 Erethon

In our case it was https://github.com/matrix-org/synapse/blob/b460a088c647a6d3ea0e5a9f4f80d86bb9e303b3/synapse/handlers/device.py#L902 which stopped syncing our 1500 devices list.

Jan 14 '21 15:01 deepbluev7

Is there a way to force resync? I have a DM with me (from my HS) and a guy from matrix.org. He added a new device and than deleted it, but I still see that device as unverified in DM.

Oct 19 '21 09:10 532910

Are people still seeing this issue? Specifically deleting devices not propagating over federation? There's been some work in this area over the past year IIRC.

Is there a way to force resync? I have a DM with me (from my HS) and a guy from matrix.org. He added a new device and than deleted it, but I still see that device as unverified in DM.

@532910 The DB query in the issue description probably still works

Oct 29 '21 13:10 erikjohnston

The DB query in the issue description probably still works

Indeed!

Oct 29 '21 14:10 532910

@erikjohnston i'm still facing the issue, yes. On Synapse 1.48.

Dec 03 '21 09:12 rihardsk

I'm seeing one unverified device for @sneaker-net:matrix.org from @babolivier:element.io but they've assured me they've verified it on their end.

Dec 03 '21 16:12 babolivier

From my reading of the spec, clients either

watch device_lists.changed/left in the sync response, or
periodically query /keys/changes after syncs

to discover device list changes or that they can forget a user's devices entirely.

Assuming my understanding is correct, I see three ways this can go wrong on Synapse's end. Either:

the /sync response incorrectly calculates the list of changed or left users in device_lists; or
the /keys/changed response makes such an incorrect calculation; or
the /keys/query response incorrectly reports the device list for the user in question.

I'm going to ignore (2) because /keys/changed is hardly ever used on matrix.org (circa 100 per day) compared to /keys/query (circa a million per day).

Do we have any idea whether the problems are due to (1) or (3)? Guessing (1) but I'd like to just think things through carefully. Edit: actually, some of the older comments on this issue sound like they're describing (3)?

Edit 17th Jan 2021: I hadn't fully appreciated that this relies on EDUs over federation. We need to also investigate if something goes wrong sending edus or receiving edus.

Jan 07 '22 20:01 DMRobertson

The symptoms feel like they're more pointing towards (3), given viewing the user's device list should fix it. It's worth noting that most of the incorrect cases of device lists are over federation, so possibly Synapse rate limiting itself from querying for updates?

Noting mostly the comment about 480 devices turning into 4, but the server reported 318 - this sounds like after 162 updates over federation, the receiving server decided to rate limit itself and stopped parsing them for some reason. Given 162 isn't really close to a cutoff point that I'd expect to see in code, it sounds time-based (X updates per Y minutes, or something).

Jan 07 '22 21:01 turt2live

Thanks @turt2live, that's very helpful. There is some logic related to rate limiting, though I'm not familiar with the details.

https://github.com/matrix-org/synapse/blob/88a78c6577086527e4569541b09e437a1ca0d1a9/synapse/handlers/device.py#L869-L879

I spent today trying to write a dumb complement test that spams lots of device creations in succession, and checks they can be queried after they appear in a sync result. This isn't reliable unless I add some additional Sleep calls in there. Very odd. Odder still because complement is running against a monolith-mode Synapse, so there shouldn't be any replication delays to worry about.

Jan 10 '22 20:01 DMRobertson

This isn't reliable unless I add some additional Sleep calls in there.

This is explained by the fact that registering a new device and uploading keys sends out two m.device_list_updates over federation: one for the displayname, and one for the keys. Accounting for that, I don't see any kinds of weirdness.

Jan 11 '22 16:01 DMRobertson

I don't think we can practically make progress without a way to semi-reliably reproduce one of these problems, or else logs from both homeservers (one of them being matrix.org is fine) capturing the problem.

If there's anyone on matrix.org who sees this a lot, please let me know and we can set up additional debug tracing specifically for your mxid.

Jan 12 '22 16:01 DMRobertson

it may be that the fix to #11727 helps with this.

Jan 13 '22 15:01 richvdh

I had this happen with a matrix.org user fwiw, but no idea when it desynchronized so not entirely valuable. Unfortunately only happen to notice that the device lists are wrong when people send DMs (which doesn't happen all that often) and they've been cross-sign verified, so the new device sticks out as a red alarm badge.

Jan 13 '22 22:01 turt2live

Seemingly had this problem today with my @neilalexander:matrix.org user when I logged in with session ID LZOCIDQOCP. Device list updates don't seem to have been sent properly over federation. Relevant logs on matrix.org should be between Mon 17 Jan ~ 15:00 to 15:40 GMT.

Jan 17 '22 15:01 neilalexander

Thanks for the report @neilalexander and @turt2live for your logs too. I couldn't see anything in either neilalexander.dev nor t2l.io's logs indicating that they'd received a device list update for Neil's device. Unfortunately, there weren't many clues on matrix.org to explain this. I think we'll need more debug on Synapse's side: #11760.

For any future readers, it's worth calling out some relevant log lines. When everything works, HSes receiving an update should see "Received txn <TXN_ID>", followed by "Recieved device list update for ...".

In this example, the HS

receives a federation transaction
spots that one of the EDUs within is a device update
queries the remote HS for the device list
sees that there isn't a change to record
finishes serving the the original request (PUT-234062-)

2022-01-17 19:06:11,277 - synapse.federation.transport.server.federation - 90 - INFO - PUT-234062- Received txn 1639650208722 from example.com. (PDUs: 0, EDUs: 1)
2022-01-17 19:06:11,281 - synapse.handlers.device - 750 - INFO - PUT-234062- Received device list update for @alice:example.com, requiring resync: True. Devices: EXAMPLE_DEVICE_ID
2022-01-17 19:06:12,901 - synapse.http.matrixfederationclient - 291 - INFO - PUT-234062- {GET-O-73050} [example.com] Completed request: 200 OK in 1.62 secs, got 5524 bytes - GET matrix://example.com/_matrix/federation/v1/user/devices/%40alice%3Aexample.com
2022-01-17 19:06:12,904 - root - 951 - INFO - PUT-234062- Skipping device list resync for @alice:example.com, as our cache matches already
2022-01-17 19:06:12,909 - synapse.access.http.8008 - 400 - INFO - PUT-234062- 123.456.789.123 - 8008 - {example.com} Processed request: 1.633sec/0.001sec (0.004sec, 0.000sec) (0.002sec/0.005sec/5) 11B 200 "PUT /_matrix/federation/v1/send/1639650208722 HTTP/1.1" "Synapse/1.49.0" [0 dbevts]

I'd probably do some grepping for "device list.*$MXID" and then look at every line containing that request ID.

Jan 17 '22 19:01 DMRobertson

The sending side logs are terser:

federation_sender4.log:2022-01-17 15:07:25,021 - synapse.federation.sender.transaction_manager - 125 - INFO - federation_transaction_transmission_loop-10645196- - TX [neilalexander.dev] {1642172661024} Sending transaction [1642172661024], (PDUs: 0, EDUs: 1)
federation_sender4.log:2022-01-17 15:07:25,308 - synapse.http.matrixfederationclient - 301 - INFO - federation_transaction_transmission_loop-10645196- - {PUT-O-8624718} [neilalexander.dev] Completed request: 200 OK in 0.29 secs, got 11 bytes - PUT matrix://neilalexander.dev/_matrix/federation/v1/send/1642172661024
federation_sender4.log:2022-01-17 15:07:25,308 - synapse.federation.sender.transaction_manager - 158 - INFO - federation_transaction_transmission_loop-10645196- - TX [neilalexander.dev] {1642172661024} got 200 response

I'd like to understand how (if at all) the transaction ids relate to the stream_id used for device updates.

Jan 17 '22 19:01 DMRobertson

One thing I noticed, I can't find ANY device list update for @neilalexander:matrix.org. I did however find 14:51:51,262 - synapse.handlers.devicemessage - 193 - INFO - POST-1245690 - Received device message from remote device not in our cache: @neilalexander:matrix.org {'LZOCIDQOCP'} in my logs. Are device list federations for that user maybe completely broken and my server only updated the device list, because it got a message from a device, which does not exist? Which would explain, why I see the device, but the verification status is wrong.

Jan 17 '22 19:01 deepbluev7

Are device list federations for that user maybe completely broken and my server only updated the device list, because it got a message from a device, which does not exist?

This is what I saw in Travis's logs. But I've also seen logs EMS homesevers which did get a device list update for LZOCIDQOCP.

Jan 18 '22 14:01 DMRobertson

#11760 Adds some additional debugging for this. It's off by default, but can be enabled by adding a logging config entry as follows:


loggers:
    # debug for https://github.com/matrix-org/synapse/issues/8631
    synapse.8631_debug:
        level: DEBUG

I suspect we'll get the most value out of seeing this on the sender side, but there is logging on the receive side as well. I'd be very grateful if anyone who sees this frequently could apply this to their homeserver.

Jan 24 '22 12:01 DMRobertson

I spent some time with getting the debug logs activated and they are logging all right. Made test accounts on 2 synapse homeservers I'm running and they are sending device list updates successfully with no problems. I haven't been able to trigger the glitch here. So far my experience it has only happened with updating device lists on a matrix.org account (not propagating to federation). I intend to do further tests later, directed to adding/removing/updating devices specifically on a matrix.org account.

Mar 23 '22 03:03 tril0byte

the debug code seems faulty, on homeservers with it enabled, I can no longer receive messages.

2022-03-23 20:29:02,722 - synapse.access.http.8813 - 448 - INFO - PUT-42 - 54.37.22.104 - 8813 - {zaplog.pro} Processed request: 
0.006sec/0.002sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 31B 400 "PUT /_matrix/federation/v1/send/1647989527544 HTTP/1.0" "Sy
napse/1.55.0" [0 dbevts]
2022-03-23 20:29:55,846 - synapse.federation.transport.server.federation - 108 - INFO - PUT-43 - Received txn 1645740405899 from 
86thumbs.net. (PDUs: 0, EDUs: 1)
2022-03-23 20:29:55,846 - synapse.federation.transport.server.federation - 126 - ERROR - PUT-43 - 'dict' object has no attribute 
'edu_type'
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: {'ed25519:a_HMBy': FetchKeyResult(verify_key=<nacl.signing.VerifyKey object at 0x7f7c4416b2e8>, valid_until_ts=1648102945835)}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/synapse/federation/transport/server/federation.py", line 115, in on_PUT
    for edu in transaction_data.get("edus", [])
  File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/synapse/federation/transport/server/federation.py", line 116, in <listcomp>
    if edu.edu_type in DEVICE_UPDATE_EDUS
AttributeError: 'dict' object has no attribute 'edu_type'

I disabled the 8631 specific debugging from workers handling inbound requests for /_matrix/federation/v1/send/ and I can receive messages again.

Mar 24 '22 03:03 tril0byte

the debug code seems faulty

Indeed, apologies; that's my bug. #11890 and #12098 together should fix that. Those are included with Synapse 1.54 and above.

Mar 24 '22 18:03 DMRobertson

I think this is still waiting for people to enable the debug logging and report their findings.

@DMRobertson is that correct?

May 04 '22 15:05 richvdh

@richvdh that's right. Lacking a way to reproduce this, we don't yet understand if the problem is in sending out the updates or receiving them.

May 04 '22 15:05 DMRobertson

I've now enabled the additional logging on matrix.org, so if any users on there observe this problem, please let us know so we can investigate.

May 05 '22 12:05 richvdh