dendrite icon indicating copy to clipboard operation
dendrite copied to clipboard

Dendrite syncing broken after upgrade to 0.8.8

Open edocod1 opened this issue 3 years ago • 14 comments

Background information

  • Dendrite version or git SHA: 0.8.8 (though it happens on 0.8.9 too)
  • Monolith or Polylith?: Monolith
  • SQLite3 or Postgres?: Postgres
  • Running in Docker?: No
  • go version: go 1.17.18
  • Client used (if applicable): nheko, FluffyChat and element web

Description

  • What is the problem: Rooms stopped updating and i haven't been able to send new events to any room
  • Who is affected: everybody on my own server
  • How is this bug manifesting: all rooms appear frozen since the update, all sent messages disappear after a while, there are no useful errors in the logs (everything seems to work fine)
  • When did this first appear: after upgrading to 0.8.8

Steps to reproduce

  • Update to 0.8.8
  • ?????
  • Dendrite syncing does not work anymore

I would write to the dendrite support room, but with my homeserver not working it is very hard to do so. This is not a "pls fix" request but more of a "hey how do i troubleshoot this?"

edocod1 avatar Jul 12 '22 07:07 edocod1

When rebooting the server, it looks like a lot of messages are stuck in DendriteRoomOutputEvent


[23931] [INF] 
[23931] [INF] ---------------- JETSTREAM ----------------
[23931] [INF]   Max Memory:      23.00 GB
[23931] [INF]   Max Storage:     7.24 GB
[23931] [INF]   Store Directory: "jetstream"
[23931] [INF] -------------------------------------------
[23931] [INF]   Restored 0 messages for stream '$G > DendriteInputDeviceListUpdate'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteInputRoomEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputClientData'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputKeyChangeEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputNotificationData'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputReadUpdate'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputReceiptEvent'
[23931] [INF]   Restored 14,609 messages for stream '$G > DendriteOutputRoomEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputStreamEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteInputDeviceListUpdate'
[23931] [INF]   Recovering 21 consumers for stream - '$G > DendriteInputRoomEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputClientData'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputKeyChangeEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputNotificationData'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputReadUpdate'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputReceiptEvent'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputRoomEvent'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputStreamEvent'
[23931] [INF] Server is ready

edocod1 avatar Jul 12 '22 07:07 edocod1

Have you tried moving the jetstream directory out of the way before a restart, (jetstream/$G/streams/)? It sometimes seems to get stuck on an upgrade...

spaetz avatar Jul 12 '22 08:07 spaetz

Yes, i have tried that with no success. But doing that unlocks some more log entries which wouldn't otherwise appear:

  • failed to updated presence for user error="sql: converting argument $4 type: uint64 values with high bit set are not supported" presence=offline user="@xxxx:xxxxx.xx"
  • ERRO[2022-07-12T08:16:18.134230617Z] Failed to manually update device lists for user error="DeviceKeysForUser @xxxx:xxx.xx [] failed: context canceled" server=xxx.xx user_id="@xxxx:xxx.xx"
  • 2022/07/12 08:17:04 Transport: unhandled response frame type *http.http2UnknownFrame

edocod1 avatar Jul 12 '22 08:07 edocod1

Can you please enable profiling and then take a goroutine profile after trying to send some messages into a room?

The profile URL will be like this: http://localhost:65432/debug/pprof/goroutine?debug=1

neilalexander avatar Jul 12 '22 08:07 neilalexander

It seems I had the same problem after a power shutdown. Removing the jetstream folder fixed it thanks @spaetz , but I lost all unbridged messages.

By the way, when using ls I don't have /jetstream/$G/streams but instead an endless list of subdirectories named $G/$G/$G... du shows the right tree.

Notmarrco avatar Jul 12 '22 08:07 Notmarrco

@neilalexander profile here profile.txt

edocod1 avatar Jul 12 '22 09:07 edocod1

Can you please also capture a http://localhost:65432/debug/pprof/goroutine?debug=2 (that's debug=2 instead of debug=1)?

Ideally without restarting the process if you can, since that'll nuke possibly valuable information from the trace.

neilalexander avatar Jul 12 '22 09:07 neilalexander

https://paste.debian.net/1246946/

Here it is

edocod1 avatar Jul 12 '22 09:07 edocod1

Right, my suspicion is that there's a problem with the database connection pool. It looks like Dendrite is stuck waiting for database connections, which probably explains why it isn't progressing.

Can you please tell me whether or not you have configured database connections separately per-component, or did you configure it once in the global.database section of the config?

What values do you have for max_open_conns, max_idle_conns and conn_max_lifetime?

neilalexander avatar Jul 12 '22 09:07 neilalexander

I have configured database connections separately (though they all point to the same db)

Limits are these

# cat dendrite.yaml | grep max_open
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
root@dendrite:~/dendrite# cat dendrite.yaml | grep max_idle
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
root@dendrite:~/dendrite# cat dendrite.yaml | grep conn_max_l
# "conn_max_lifetime" option controls the maximum length of time a database
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1

edocod1 avatar Jul 12 '22 09:07 edocod1

Hmm, that all looks reasonable... on a second look, the database connections are probably waiting normally (was looking at sql.go from the wrong Go version and line numbers differed).

Can I ask you to try the latest commit 55d81beab on the neilalexander/nats290 branch which updates the NATS Server to a newer version, just to rule that out?

neilalexander avatar Jul 12 '22 09:07 neilalexander

Looks like nothing changed. After some time of running, i still have the exact number of messages in the jetstream queue, and everything is stuck as before.

edocod1 avatar Jul 12 '22 09:07 edocod1

I occasionally run into this error too. The classic symptom is that messages seem to send correctly in Element, but they never get to their destination. When Element is closed and reopened, "some recent messages were not sent" pops out.

nullchinchilla avatar Jul 25 '22 14:07 nullchinchilla

It unlocked for some days, since yesterday it is locked again.

Using dendrite 0.9.1 now

Log is full of "Failed do manually update device lists for user"

edocod1 avatar Aug 08 '22 22:08 edocod1