Device lists are desynchronizing
Description
People seem to be seeing/getting outdated device lists from Synapse regardless of server. This is manifesting in the fairly normal ways: wrong names, devices that should have been deleted, etc. New devices seem to be included correctly all the time, but usually with a mix of incorrect devices.
insert into device_lists_remote_resync (user_id, added_ts) values ('@victim:example.org', (extract(epoch from now())::bigint * 1000)); fixes the problem for the relevant victims.
Appears to be a wider issue than just matrix.org
Known cases:
- Someone deleted their extra devices, but no one saw that.
- Someone cross-signed their new device, but no one believed them.
- Someone renamed their device, but no one saw that.
The rename trick for trying to confuse/convince the server into updating doesn't appear to work anymore either - only the database insert functions.
Started appearing months ago.
Is this a duplicate / superset of #8459?
ah ha, I did report it at the time - I'm so smart. I'd close https://github.com/matrix-org/synapse/issues/8459 in favour of this as this is the more general version of "it's all broken", but I'll leave that up to you.
related: #4827?
We have a similar issue. We have a bot with 1500 devices (for E2EE performance testing). Its device list only shows up on some servers though. This may be easier to reproduce with such bigger device lists?
Just ran into this myself, a friend on matrix.org had about 480 devices, deleted all of them except 4 and my homeserver reported 318 devices still being around. Running the insert query from the OP fixed it for me.
In our case it was https://github.com/matrix-org/synapse/blob/b460a088c647a6d3ea0e5a9f4f80d86bb9e303b3/synapse/handlers/device.py#L902 which stopped syncing our 1500 devices list.
Is there a way to force resync? I have a DM with me (from my HS) and a guy from matrix.org. He added a new device and than deleted it, but I still see that device as unverified in DM.
Are people still seeing this issue? Specifically deleting devices not propagating over federation? There's been some work in this area over the past year IIRC.
Is there a way to force resync? I have a DM with me (from my HS) and a guy from matrix.org. He added a new device and than deleted it, but I still see that device as unverified in DM.
@532910 The DB query in the issue description probably still works
The DB query in the issue description probably still works
Indeed!
@erikjohnston i'm still facing the issue, yes. On Synapse 1.48.
I'm seeing one unverified device for @sneaker-net:matrix.org from @babolivier:element.io but they've assured me they've verified it on their end.
From my reading of the spec, clients either
- watch
device_lists.changed/leftin thesyncresponse, or - periodically query
/keys/changesafter syncs
to discover device list changes or that they can forget a user's devices entirely.
Assuming my understanding is correct, I see three ways this can go wrong on Synapse's end. Either:
- the
/syncresponse incorrectly calculates the list ofchangedorleftusers indevice_lists; or - the
/keys/changedresponse makes such an incorrect calculation; or - the
/keys/queryresponse incorrectly reports the device list for the user in question.
I'm going to ignore (2) because /keys/changed is hardly ever used on matrix.org (circa 100 per day) compared to /keys/query (circa a million per day).
Do we have any idea whether the problems are due to (1) or (3)? Guessing (1) but I'd like to just think things through carefully. Edit: actually, some of the older comments on this issue sound like they're describing (3)?
Edit 17th Jan 2021: I hadn't fully appreciated that this relies on EDUs over federation. We need to also investigate if something goes wrong sending edus or receiving edus.
The symptoms feel like they're more pointing towards (3), given viewing the user's device list should fix it. It's worth noting that most of the incorrect cases of device lists are over federation, so possibly Synapse rate limiting itself from querying for updates?
Noting mostly the comment about 480 devices turning into 4, but the server reported 318 - this sounds like after 162 updates over federation, the receiving server decided to rate limit itself and stopped parsing them for some reason. Given 162 isn't really close to a cutoff point that I'd expect to see in code, it sounds time-based (X updates per Y minutes, or something).
Thanks @turt2live, that's very helpful. There is some logic related to rate limiting, though I'm not familiar with the details.
https://github.com/matrix-org/synapse/blob/88a78c6577086527e4569541b09e437a1ca0d1a9/synapse/handlers/device.py#L869-L879
I spent today trying to write a dumb complement test that spams lots of device creations in succession, and checks they can be queried after they appear in a sync result. This isn't reliable unless I add some additional Sleep calls in there. Very odd. Odder still because complement is running against a monolith-mode Synapse, so there shouldn't be any replication delays to worry about.
This isn't reliable unless I add some additional Sleep calls in there.
This is explained by the fact that registering a new device and uploading keys sends out two m.device_list_updates over federation: one for the displayname, and one for the keys. Accounting for that, I don't see any kinds of weirdness.
I don't think we can practically make progress without a way to semi-reliably reproduce one of these problems, or else logs from both homeservers (one of them being matrix.org is fine) capturing the problem.
If there's anyone on matrix.org who sees this a lot, please let me know and we can set up additional debug tracing specifically for your mxid.
it may be that the fix to #11727 helps with this.
I had this happen with a matrix.org user fwiw, but no idea when it desynchronized so not entirely valuable. Unfortunately only happen to notice that the device lists are wrong when people send DMs (which doesn't happen all that often) and they've been cross-sign verified, so the new device sticks out as a red alarm badge.
Seemingly had this problem today with my @neilalexander:matrix.org user when I logged in with session ID LZOCIDQOCP. Device list updates don't seem to have been sent properly over federation. Relevant logs on matrix.org should be between Mon 17 Jan ~ 15:00 to 15:40 GMT.
Thanks for the report @neilalexander and @turt2live for your logs too. I couldn't see anything in either neilalexander.dev nor t2l.io's logs indicating that they'd received a device list update for Neil's device. Unfortunately, there weren't many clues on matrix.org to explain this. I think we'll need more debug on Synapse's side: #11760.
For any future readers, it's worth calling out some relevant log lines. When everything works, HSes receiving an update should see "Received txn <TXN_ID>", followed by "Recieved device list update for ...".
In this example, the HS
- receives a federation transaction
- spots that one of the EDUs within is a device update
- queries the remote HS for the device list
- sees that there isn't a change to record
- finishes serving the the original request (
PUT-234062-)
2022-01-17 19:06:11,277 - synapse.federation.transport.server.federation - 90 - INFO - PUT-234062- Received txn 1639650208722 from example.com. (PDUs: 0, EDUs: 1)
2022-01-17 19:06:11,281 - synapse.handlers.device - 750 - INFO - PUT-234062- Received device list update for @alice:example.com, requiring resync: True. Devices: EXAMPLE_DEVICE_ID
2022-01-17 19:06:12,901 - synapse.http.matrixfederationclient - 291 - INFO - PUT-234062- {GET-O-73050} [example.com] Completed request: 200 OK in 1.62 secs, got 5524 bytes - GET matrix://example.com/_matrix/federation/v1/user/devices/%40alice%3Aexample.com
2022-01-17 19:06:12,904 - root - 951 - INFO - PUT-234062- Skipping device list resync for @alice:example.com, as our cache matches already
2022-01-17 19:06:12,909 - synapse.access.http.8008 - 400 - INFO - PUT-234062- 123.456.789.123 - 8008 - {example.com} Processed request: 1.633sec/0.001sec (0.004sec, 0.000sec) (0.002sec/0.005sec/5) 11B 200 "PUT /_matrix/federation/v1/send/1639650208722 HTTP/1.1" "Synapse/1.49.0" [0 dbevts]
I'd probably do some grepping for "device list.*$MXID" and then look at every line containing that request ID.
The sending side logs are terser:
federation_sender4.log:2022-01-17 15:07:25,021 - synapse.federation.sender.transaction_manager - 125 - INFO - federation_transaction_transmission_loop-10645196- - TX [neilalexander.dev] {1642172661024} Sending transaction [1642172661024], (PDUs: 0, EDUs: 1)
federation_sender4.log:2022-01-17 15:07:25,308 - synapse.http.matrixfederationclient - 301 - INFO - federation_transaction_transmission_loop-10645196- - {PUT-O-8624718} [neilalexander.dev] Completed request: 200 OK in 0.29 secs, got 11 bytes - PUT matrix://neilalexander.dev/_matrix/federation/v1/send/1642172661024
federation_sender4.log:2022-01-17 15:07:25,308 - synapse.federation.sender.transaction_manager - 158 - INFO - federation_transaction_transmission_loop-10645196- - TX [neilalexander.dev] {1642172661024} got 200 response
I'd like to understand how (if at all) the transaction ids relate to the stream_id used for device updates.
One thing I noticed, I can't find ANY device list update for @neilalexander:matrix.org. I did however find 14:51:51,262 - synapse.handlers.devicemessage - 193 - INFO - POST-1245690 - Received device message from remote device not in our cache: @neilalexander:matrix.org {'LZOCIDQOCP'} in my logs. Are device list federations for that user maybe completely broken and my server only updated the device list, because it got a message from a device, which does not exist? Which would explain, why I see the device, but the verification status is wrong.
Are device list federations for that user maybe completely broken and my server only updated the device list, because it got a message from a device, which does not exist?
This is what I saw in Travis's logs. But I've also seen logs EMS homesevers which did get a device list update for LZOCIDQOCP.
#11760 Adds some additional debugging for this. It's off by default, but can be enabled by adding a logging config entry as follows:
loggers:
# debug for https://github.com/matrix-org/synapse/issues/8631
synapse.8631_debug:
level: DEBUG
I suspect we'll get the most value out of seeing this on the sender side, but there is logging on the receive side as well. I'd be very grateful if anyone who sees this frequently could apply this to their homeserver.
I spent some time with getting the debug logs activated and they are logging all right. Made test accounts on 2 synapse homeservers I'm running and they are sending device list updates successfully with no problems. I haven't been able to trigger the glitch here. So far my experience it has only happened with updating device lists on a matrix.org account (not propagating to federation). I intend to do further tests later, directed to adding/removing/updating devices specifically on a matrix.org account.
the debug code seems faulty, on homeservers with it enabled, I can no longer receive messages.
2022-03-23 20:29:02,722 - synapse.access.http.8813 - 448 - INFO - PUT-42 - 54.37.22.104 - 8813 - {zaplog.pro} Processed request:
0.006sec/0.002sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 31B 400 "PUT /_matrix/federation/v1/send/1647989527544 HTTP/1.0" "Sy
napse/1.55.0" [0 dbevts]
2022-03-23 20:29:55,846 - synapse.federation.transport.server.federation - 108 - INFO - PUT-43 - Received txn 1645740405899 from
86thumbs.net. (PDUs: 0, EDUs: 1)
2022-03-23 20:29:55,846 - synapse.federation.transport.server.federation - 126 - ERROR - PUT-43 - 'dict' object has no attribute
'edu_type'
Traceback (most recent call last):
File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
result = current_context.run(gen.send, result)
StopIteration: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
result = current_context.run(gen.send, result)
StopIteration: {'ed25519:a_HMBy': FetchKeyResult(verify_key=<nacl.signing.VerifyKey object at 0x7f7c4416b2e8>, valid_until_ts=1648102945835)}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/synapse/federation/transport/server/federation.py", line 115, in on_PUT
for edu in transaction_data.get("edus", [])
File "/opt/venvs/matrix-synapse/lib/python3.7/site-packages/synapse/federation/transport/server/federation.py", line 116, in <listcomp>
if edu.edu_type in DEVICE_UPDATE_EDUS
AttributeError: 'dict' object has no attribute 'edu_type'
I disabled the 8631 specific debugging from workers handling inbound requests for /_matrix/federation/v1/send/ and I can receive messages again.
the debug code seems faulty
Indeed, apologies; that's my bug. #11890 and #12098 together should fix that. Those are included with Synapse 1.54 and above.
I think this is still waiting for people to enable the debug logging and report their findings.
@DMRobertson is that correct?
@richvdh that's right. Lacking a way to reproduce this, we don't yet understand if the problem is in sending out the updates or receiving them.
I've now enabled the additional logging on matrix.org, so if any users on there observe this problem, please let us know so we can investigate.