Test timeouts in ci
I've had backend tests timeout multiple times today in the Rafiki GitHub Actions workflows:
https://github.com/interledger/rafiki/runs/5486418533?check_suite_focus=true
Output includes:
notice(message_bus): message queue for peer tigerbeetle.src.message_bus.union:392:19{ .replica = 0 } full, dropping message
The workflows appear to succeed on a retry.
Thanks @wilsonianb is this log output from the TB client?
Scrolling through the CI log it looks like replica 0's connection was not stable and was being reconnected a few times before the message queue full errors started popping up.
Would it be possible to get the log output from all replicas in the cluster?
I've pushed an updated log message to main (Zig 0.9.1) that will include the type of message being dropped for more visibility: https://github.com/coilhq/tigerbeetle/commit/1f07119a3df3ab590217273d868ad620aba99791
Scrolling through the CI log it looks like replica 0's connection was not stable and was being reconnected a few times before the message queue full errors started popping up.
If you're referring to the info(message_bus): connected to replica 0 log lines:
The tests repeatedly start/stop (init/deinit) the client; I think that's what you're seeing. This is part of how each test file sets up & tears down its environment (appContainer = await createTestApp(deps) and await appContainer.shutdown() respectively).
I've been unable to reproduce the issue in a few dozen attempts today, some of which with the tigerbeetle logging change: https://github.com/wilsonianb/rafiki/pull/84 :confused:
I'm encountering the issue again today https://github.com/wilsonianb/rafiki/actions @jorangreef It looks like Github Actions are disabled in the tigerbeetle repo, so there isn't a recent Docker image for Rafiki to use to update to the latest tigerbeetle-node (and use the updated log message). Could you try enabling Github Actions and/or triggering the docker workflow on the latest commit to main?
The updated log message is merged :point_up_2:
It looks like there's a combination of ping and request messages being dropped:
https://github.com/interledger/rafiki/runs/5872664693?check_suite_focus=true
info(message_bus): message queue for peer tigerbeetle.src.message_bus.union:384:19{ .replica = 0 } full, dropping request message
info(message_bus): message queue for peer tigerbeetle.src.message_bus.union:384:19{ .replica = 0 } full, dropping ping message
with more logging: https://github.com/wilsonianb/rafiki/runs/5874848460?check_suite_focus=true
The queued: log lines are from https://github.com/coilhq/tigerbeetle/blob/1a2863d2cfc3568efae18ef1749a28c21a245985/src/message_bus.zig#L604-L616.
The message bus connection's send queue is full of retries for the register request.
info(message_bus): message queue for peer tigerbeetle.src.message_bus.union:384:19{ .replica = 0 } full, dropping request message operation=register client=243897649607243345099344348854093531920 op=0 checksum=144046189412347373692193060048604389285 parent=0 context=0 connection.state=connecting
info(message_bus): queued: command=request operation=register client=243897649607243345099344348854093531920 op=0 checksum=144046189412347373692193060048604389285 parent=0 context=0
info(message_bus): queued: command=request operation=register client=243897649607243345099344348854093531920 op=0 checksum=144046189412347373692193060048604389285 parent=0 context=0