dendrite
dendrite copied to clipboard
Dendrite syncing broken after upgrade to 0.8.8
Background information
- Dendrite version or git SHA: 0.8.8 (though it happens on 0.8.9 too)
- Monolith or Polylith?: Monolith
- SQLite3 or Postgres?: Postgres
- Running in Docker?: No
go version: go 1.17.18- Client used (if applicable): nheko, FluffyChat and element web
Description
- What is the problem: Rooms stopped updating and i haven't been able to send new events to any room
- Who is affected: everybody on my own server
- How is this bug manifesting: all rooms appear frozen since the update, all sent messages disappear after a while, there are no useful errors in the logs (everything seems to work fine)
- When did this first appear: after upgrading to 0.8.8
Steps to reproduce
- Update to 0.8.8
- ?????
- Dendrite syncing does not work anymore
I would write to the dendrite support room, but with my homeserver not working it is very hard to do so. This is not a "pls fix" request but more of a "hey how do i troubleshoot this?"
When rebooting the server, it looks like a lot of messages are stuck in DendriteRoomOutputEvent
[23931] [INF]
[23931] [INF] ---------------- JETSTREAM ----------------
[23931] [INF] Max Memory: 23.00 GB
[23931] [INF] Max Storage: 7.24 GB
[23931] [INF] Store Directory: "jetstream"
[23931] [INF] -------------------------------------------
[23931] [INF] Restored 0 messages for stream '$G > DendriteInputDeviceListUpdate'
[23931] [INF] Restored 0 messages for stream '$G > DendriteInputRoomEvent'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputClientData'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputKeyChangeEvent'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputNotificationData'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputReadUpdate'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputReceiptEvent'
[23931] [INF] Restored 14,609 messages for stream '$G > DendriteOutputRoomEvent'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF] Restored 0 messages for stream '$G > DendriteOutputStreamEvent'
[23931] [INF] Recovering 1 consumers for stream - '$G > DendriteInputDeviceListUpdate'
[23931] [INF] Recovering 21 consumers for stream - '$G > DendriteInputRoomEvent'
[23931] [INF] Recovering 1 consumers for stream - '$G > DendriteOutputClientData'
[23931] [INF] Recovering 2 consumers for stream - '$G > DendriteOutputKeyChangeEvent'
[23931] [INF] Recovering 1 consumers for stream - '$G > DendriteOutputNotificationData'
[23931] [INF] Recovering 1 consumers for stream - '$G > DendriteOutputReadUpdate'
[23931] [INF] Recovering 2 consumers for stream - '$G > DendriteOutputReceiptEvent'
[23931] [INF] Recovering 2 consumers for stream - '$G > DendriteOutputRoomEvent'
[23931] [INF] Recovering 2 consumers for stream - '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF] Recovering 1 consumers for stream - '$G > DendriteOutputStreamEvent'
[23931] [INF] Server is ready
Have you tried moving the jetstream directory out of the way before a restart, (jetstream/$G/streams/)? It sometimes seems to get stuck on an upgrade...
Yes, i have tried that with no success. But doing that unlocks some more log entries which wouldn't otherwise appear:
- failed to updated presence for user error="sql: converting argument $4 type: uint64 values with high bit set are not supported" presence=offline user="@xxxx:xxxxx.xx"
- ERRO[2022-07-12T08:16:18.134230617Z] Failed to manually update device lists for user error="DeviceKeysForUser @xxxx:xxx.xx [] failed: context canceled" server=xxx.xx user_id="@xxxx:xxx.xx"
- 2022/07/12 08:17:04 Transport: unhandled response frame type *http.http2UnknownFrame
Can you please enable profiling and then take a goroutine profile after trying to send some messages into a room?
The profile URL will be like this: http://localhost:65432/debug/pprof/goroutine?debug=1
It seems I had the same problem after a power shutdown. Removing the jetstream folder fixed it thanks @spaetz , but I lost all unbridged messages.
By the way, when using ls I don't have /jetstream/$G/streams but instead an endless list of subdirectories named $G/$G/$G...
du shows the right tree.
@neilalexander profile here profile.txt
Can you please also capture a http://localhost:65432/debug/pprof/goroutine?debug=2 (that's debug=2 instead of debug=1)?
Ideally without restarting the process if you can, since that'll nuke possibly valuable information from the trace.
https://paste.debian.net/1246946/
Here it is
Right, my suspicion is that there's a problem with the database connection pool. It looks like Dendrite is stuck waiting for database connections, which probably explains why it isn't progressing.
Can you please tell me whether or not you have configured database connections separately per-component, or did you configure it once in the global.database section of the config?
What values do you have for max_open_conns, max_idle_conns and conn_max_lifetime?
I have configured database connections separately (though they all point to the same db)
Limits are these
# cat dendrite.yaml | grep max_open
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
max_open_conns: 10
root@dendrite:~/dendrite# cat dendrite.yaml | grep max_idle
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
max_idle_conns: 2
root@dendrite:~/dendrite# cat dendrite.yaml | grep conn_max_l
# "conn_max_lifetime" option controls the maximum length of time a database
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
conn_max_lifetime: -1
Hmm, that all looks reasonable... on a second look, the database connections are probably waiting normally (was looking at sql.go from the wrong Go version and line numbers differed).
Can I ask you to try the latest commit 55d81beab on the neilalexander/nats290 branch which updates the NATS Server to a newer version, just to rule that out?
Looks like nothing changed. After some time of running, i still have the exact number of messages in the jetstream queue, and everything is stuck as before.
I occasionally run into this error too. The classic symptom is that messages seem to send correctly in Element, but they never get to their destination. When Element is closed and reopened, "some recent messages were not sent" pops out.
It unlocked for some days, since yesterday it is locked again.
Using dendrite 0.9.1 now
Log is full of "Failed do manually update device lists for user"