`/sync` randomly tightloops
Description
In my test environment for matrix-media-repo I've got a localhost homeserver running on sqlite in docker which suddenly returns immediately to /sync requests. This causes Element to dutifully send the next request, to which Synapse responds immediately, and repeat.
Steps to reproduce
Unclear.
Homeserver
localhost
Synapse Version
v1.85.2
Installation Method
Docker (matrixdotorg/synapse)
Database
sqlite3
Workers
Single process
Platform
Running in Docker.
Configuration
# Configuration file for Synapse.
#
# This is a YAML file: see [1] for a quick introduction. Note in particular
# that *indentation is important*: all the elements of a list or dictionary
# should have the same indentation.
#
# [1] https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
#
# For more information on how to configure Synapse, including a complete accounting of
# each option, go to docs/usage/configuration/config_documentation.md or
# https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html
server_name: "localhost"
pid_file: /data/homeserver.pid
listeners:
- port: 8008
tls: false
type: http
x_forwarded: true
resources:
- names: [client, federation]
compress: false
database:
name: sqlite3
args:
database: /data/homeserver.db
log_config: "/data/localhost.log.config"
media_store_path: /data/media_store
registration_shared_secret: "redacted, but it's public anyways"
report_stats: false
macaroon_secret_key: "redacted, but it's public anyways"
form_secret: "redacted, but it's public anyways"
signing_key_path: "/data/localhost.signing.key"
# Project-specific settings (not part of Synapse defaults)
enable_media_repo: false
enable_registration: true
enable_registration_without_verification: true
trusted_key_servers: []
# vim:ft=yaml
Relevant log output
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:24:45 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:24:45,058 - synapse.access.http.8008 - 461 - INFO - GET-13264 - ::ffff:172.19.0.2 - 8008 - {@travis:localhost} Processe
d request: 0.005sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:24:45 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:24:45,093 - synapse.access.http.8008 - 461 - INFO - GET-13265 - ::ffff:172.19.0.2 - 8008 - {@travis:localhost} Processe
d request: 0.005sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
Anything else that would be useful to know?
Restarting the Docker Compose stack can sometimes help, but it eventually tightloops again.
the logs at the beginning of the loop don't reveal much:
media_repo_synapse | 2023-06-24 03:27:15,741 - synapse.access.http.8008 - 461 - INFO - GET-25 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 30.007sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 438B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30871_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30871_0_14_109_1_4_30_0_1 HTTP/1.1" 200 450 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:27:15,781 - synapse.access.http.8008 - 461 - INFO - POST-29 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed
request: 0.006sec/0.000sec (0.000sec, 0.001sec) (0.002sec/0.003sec/3) 2687B 200 "POST /_matrix/client/r0/keys/query HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "POST /_matrix/client/r0/keys/query HTTP/1.1" 200 2699 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:27:15,784 - synapse.access.http.8008 - 461 - INFO - GET-30 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.008sec/0.000sec (0.000sec, 0.002sec) (0.002sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:27:15,829 - synapse.access.http.8008 - 461 - INFO - GET-31 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:27:15,864 - synapse.access.http.8008 - 461 - INFO - GET-32 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse | 2023-06-24 03:27:15,899 - synapse.access.http.8008 - 461 - INFO - GET-33 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
I think that's related to #15082.
Also seeing this in a local development environment with Synapse 1.85.2, doesn't appear to be related to #15082 because payloads are not empty.
Element web's calls to /sync are returning immediately with the following responses:
3 examples, notice they are all exactly the same:
http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1
{
"next_batch": "s1_450_0_1_1_1_1_36_0_1",
"account_data": {
"events": [
{
"type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
"content": {
"is_silenced": false
}
}
]
},
"device_one_time_keys_count": {
"signed_curve25519": 50
},
"org.matrix.msc2732.device_unused_fallback_key_types": [
"signed_curve25519"
],
"device_unused_fallback_key_types": [
"signed_curve25519"
]
}
http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1
{
"next_batch": "s1_450_0_1_1_1_1_36_0_1",
"account_data": {
"events": [
{
"type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
"content": {
"is_silenced": false
}
}
]
},
"device_one_time_keys_count": {
"signed_curve25519": 50
},
"org.matrix.msc2732.device_unused_fallback_key_types": [
"signed_curve25519"
],
"device_unused_fallback_key_types": [
"signed_curve25519"
]
}
http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1
{
"next_batch": "s1_450_0_1_1_1_1_36_0_1",
"account_data": {
"events": [
{
"type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
"content": {
"is_silenced": false
}
}
]
},
"device_one_time_keys_count": {
"signed_curve25519": 50
},
"org.matrix.msc2732.device_unused_fallback_key_types": [
"signed_curve25519"
],
"device_unused_fallback_key_types": [
"signed_curve25519"
]
}
@thebalaa
3 examples, notice they are all exactly the same:
The request URLs you have given us are also exactly the same. Synapse caches /sync responses so that it can avoid the work of recalculating them if the requests are identical due to e.g. a retransmission; it is not surprising that you get the same response to the same request!
The since token in your examples is s1_447_0_1_1_1_1_36_0_1, and the response's next_batch token is s1_450_0_1_1_1_1_36_0_1 is different. But Element Web did not use the next_batch token as the since parameter in its next request; I don't see any evidence of a problem in Synapse here.
(I note that the logs in the description and in https://github.com/matrix-org/synapse/issues/15824#issuecomment-1605249509 also repeat a request with the same since token.)
I would guess that either
- there are network connectivity problems, and EW isn't getting the response from Synapse, so retries; or
- some bug happens in the client which means it fails to process a sync response. It does not advance the next_batch token and tries again with the previous
sincetoken.
I have a similar issue on Synapse 1.88.0rc1 running locally on docker where sync (via Hydrogen) is tight looped:
since of the sync request matches the next_batch of the returned response so I don't think it's a client issue.
I noticed this yesterday and today. Turning off presence seems to stop the continuous sync. Just a workaround but I hope it helps finding the root cause.
Same problem on docker.io/matrixdotorg/synapse:v1.92.1 with sqlite. Can reliably get it to fail when I trigger an exception in another part of the code.
Repro steps involve using docker/sqlite synapse with MAS enabled and hooked up correctly. For some reason, sending messages into a room can fail:
ex-full-stack-docker-synapse-1 | 2023-09-20 13:48:18,624 - synapse.http.server - 140 - ERROR - PUT-62 - Failed handle request via 'RoomSendEventRestServlet': <XForwardedForRequest at 0xffffaab8c5d0 method='PUT' uri='/_matrix/client/v3/rooms/%21IitGfgiPurWkJFrNOJ:synapse/send/m.room.message/1' clientproto='HTTP/1.1' site='8008'>
ex-full-stack-docker-synapse-1 | Traceback (most recent call last):
ex-full-stack-docker-synapse-1 | File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
ex-full-stack-docker-synapse-1 | result = context.run(gen.send, result)
ex-full-stack-docker-synapse-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ex-full-stack-docker-synapse-1 | File "/usr/local/lib/python3.11/site-packages/synapse/rest/client/room.py", line 366, in _do
ex-full-stack-docker-synapse-1 | ) = await self.event_creation_handler.create_and_send_nonmember_event(
ex-full-stack-docker-synapse-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
ex-full-stack-docker-synapse-1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 484, in _do_execute
ex-full-stack-docker-synapse-1 | return func(sql, *args, **kwargs)
ex-full-stack-docker-synapse-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
ex-full-stack-docker-synapse-1 | sqlite3.IntegrityError: FOREIGN KEY constraint failed
ex-full-stack-docker-synapse-1 | 2023-09-20 13:48:18,628 - synapse.access.http.8008 - 465 - INFO - PUT-62 - ::ffff:172.21.0.1 - 8008 - {@bob:synapse} Processed request: 0.053sec/0.000sec (0.009sec, 0.000sec) (0.002sec/0.020sec/15) 55B 500 "PUT /_matrix/client/v3/rooms/%21IitGfgiPurWkJFrNOJ:synapse/send/m.room.message/1 HTTP/1.1" "curl/8.1.2" [0 dbevts]
When this happens, the next /sync request will always tightloop. The problem is this line is always being hit
https://github.com/matrix-org/synapse/blob/v1.92.1/synapse/handlers/sync.py#L1993
... because this is always being hit:
# _cache is not valid at or before the earliest known stream position, so
# return that the entity has changed.
if stream_pos <= self._earliest_known_stream_pos:
self.metrics.inc_misses()
return True
.. because stream_pos == self._earliest_known_stream_pos. I'm guessing the exception is causing some Fun which isn't propagating the increased stream position or something?
fwiw, I see this without sending messages. MAS is not configured. sqlite is used, though.
It will be sitting completely idle for several minutes and then suddenly tightloop. No user interaction required.
@DMRobertson the sync token is not advanced by Synapse, there's a mistake in my example above. The query string returned by synapse is provided by element but Synapse returns the same since token that was passed in the query string.
@kegsay we tried disabling presence and that seems to resolve the issue for a moment but then it returned
It will be sitting completely idle for several minutes and then suddenly tightloop. No user interaction required.
I had the same experience yesterday but I can't reproduce anymore. 1 container with synapse with a sqlite backend and the default autogenerated config plus URL preview enabled 1 container with Element Web 1 low-privileged user in a room with a few links
It suddenly started to loop extremely fast, even if I restarted the containers or refreshed the browser page, until I deactivated presence.
Hopefully https://github.com/matrix-org/synapse/pull/16540 fixes this
Sounds like it hasn't https://github.com/matrix-org/synapse/pull/16540#issuecomment-1775949595
@erikjohnston should #16540 be reverted?
should #16540 be reverted?
No, that fix is still correct, but not enough to fix this it seems.
Anything we can do to help move this along? It's quite problematic for us.
@thebalaa if you have an example request URL and response body of a /sync that would be useful
Here's two separate client responses from sync:
From Element:
Request URL http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s385_307_0_11_70_1_1_19_0_1
Response
{
"next_batch": "s385_307_0_11_70_1_1_19_0_1",
"device_lists": {
"changed": [
"@admin:localhost"
]
},
"device_one_time_keys_count": {
"signed_curve25519": 50
},
"org.matrix.msc2732.device_unused_fallback_key_types": [
"signed_curve25519"
],
"device_unused_fallback_key_types": [
"signed_curve25519"
]
}
From a Python client (matrix-nio): Request URL http://localhost:8008/_matrix/client/r0/sync?access_token=syt_YWRtaW4_KWzSqMuAsUEaRUsAQqcJ_1wY9BF&since=s385_312_0_11_70_1_1_19_0_1
Response
{
"next_batch": "s385_312_0_11_70_1_1_19_0_1",
"device_lists": {
"changed": [
"@admin:localhost"
]
},
"device_one_time_keys_count": {
"signed_curve25519": 50
},
"org.matrix.msc2732.device_unused_fallback_key_types": [
"signed_curve25519"
],
"device_unused_fallback_key_types": [
"signed_curve25519"
]
}
I'm running Synapse version 1.96.0rc1 using Docker with the image matrixdotorg/synapse:develop. I'm also using an sqlite database.
Here's my homeserver.yaml. I generated it using the generate script in the container
# Configuration file for Synapse.
#
# This is a YAML file: see [1] for a quick introduction. Note in particular
# that *indentation is important*: all the elements of a list or dictionary
# should have the same indentation.
#
# [1] https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
#
# For more information on how to configure Synapse, including a complete accounting of
# each option, go to docs/usage/configuration/config_documentation.md or
# https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html
server_name: "localhost"
pid_file: /data/homeserver.pid
listeners:
- port: 8008
tls: false
type: http
x_forwarded: true
resources:
- names: [client, federation]
compress: false
database:
name: sqlite3
args:
database: /data/homeserver.db
log_config: "/data/localhost.log.config"
media_store_path: /data/media_store
registration_shared_secret: "NKeqBJ3@=aEpq@m6&tyg,7Mp.PHsS=SKH#0gJWVH*Nhig0H9nt"
report_stats: false
macaroon_secret_key: "mRL:x_0p*VuKb2ZAio,^h&YCY77ISYs.OKr*+PBM:E0XJz:ZzV"
form_secret: "gr1E3TptS,_v-UfE6vN+l*dgc6:eSk;V3S6ZYm6B:D30O.St-h"
signing_key_path: "/data/localhost.signing.key"
trusted_key_servers:
- server_name: "matrix.org"
# vim:ft=yaml
I'm able to reproduce this issue by doing the following:
- Starting with a fresh synapse container (no existing users. I've simply just generated the homeserver.yaml file and started the server), I created the
@adminuser using theregister_new_matrix_userscript inside the synapse container. - I also disabled rate limiting for that user (for something else I've been messing around with):
INSERT INTO ratelimit_override values ("@admin:localhost", 0, 0);. I found that command from here: https://github.com/matrix-org/synapse/issues/6286#issuecomment-646944920 - I then logged in as the admin account in Element and typed in a few messages in a room for a few seconds.
- I restart the synapse container
- Once synapse is back running, I refresh my logged in Element client. Usually the tight loop starts happening after the refresh. If it doesn't happen immediately, refresh a few times more. It usually happens pretty quick.
I was messing around a little more and found that when my client sync is doing the tight loop, if I exec into the synapse container and create another user with register_new_matrix_user, the tight loop immediately stops. Could it possibly be something to do with me disabling ratelimiting for that user? That or maybe something to do with only having a single user in the database? Not sure.
This is particularly frustrating in test environments where it seems easily reproducible due to it being a low traffic environment. I'm guessing actual deployments see this less because new events are somehow able to unwedge things. In my scenario, the error on foreign key constraint will reliably reproduce this bug, but it's a convoluted setup as it involves MAS. I'll see if I can dig into this more this week.
I used the account and password in the Mjolnir configuration file to log in to Element-web, which caused a large number of never-ending _matrix/client/v3/sync requests in the console. 😂