matrix-appservice-irc
matrix-appservice-irc copied to clipboard
The Libera bridge is not monitored
In the last three months, the Libera bridge has experienced three outages affecting all channels (#1601, #1628, #1633).
There should probably be some automated monitoring to detect this kind of outage, rather than relying on users noticing and reporting them.
For example, a bot with its own private channel / portal room, which sends a message on both Matrix and IRC, and checks if it was received on the other end within 5 seconds (or whatever delay is considered acceptable); then reports it to the appropriate channels, such as https://status.matrix.org/ (which currently shows 100.0% uptime of the Libera in the last three months)
We have automated monitoring but the types of failures that we have seen are not fitting the models we expect. The status.matrix.org page is manually updated at the moment, and I think we could do better to update this as fires happen.
I believe this situation is now improved, we've got monitoring on the bridge for:
- Total outages of the process
- Increased waves of connections (indicating spam or dangerous behaviours)
- Unexpected number of clients stuck in the "connecting" state.
This doesn't cover everything, and our next objectives are to track the # of dropped messages (and ideally, some sort of E2E monitoring to see where messages are going missing).
there are currently a number of outages reported on https://github.com/matrix-org/libera-chat/issues, including clients stuck reconnecting (https://github.com/matrix-org/libera-chat/issues/12, https://github.com/matrix-org/libera-chat/issues/24), but status.matrix.org shows 100% green.
Lots of puppets can't currently join the IRC side… but status says "green".