alertmanager-irc-relay
alertmanager-irc-relay copied to clipboard
Fix Reconnecting Issues by Fixing Nick Ident and IRC PONGs
Previously, I have experienced reconnecting issues when running the alertmanager-irc-relay as an unauthorized IRC user against a ngIRCd as goirc's state tracking learns an invalid nick ident. This has resulted in an endless reconnection loop. Furthermore, it took round about fifteen minutes to detect a connection loss, as IRC PONGs were not counted. As my "production server" has a somewhat challenging uplink, the alertmanager-irc-relay's connection dies on a weekly basis.
This PR introduces both support to restore the original nick ident before attempting to reconnect and to establish an earlier reconnection based on the absence of PONGs. I had those changes running in "production" for round about a week and could now verify successfully reconnects after a short time. Furthermore, I have tested those changes on my own machine against the same IRC server with a simulated packet or connection loss.
More details are available in the commit messages:
Restore IRC Ident on Reconnect
After a connection loss on an IRC session with a ngIRCd, the
alertmanager-irc-relay was unable to reconnect. After some debugging,
the error's origin was the state tracking within the used goirc library.
When using an unidentified session, ngIRCd prefixes the user's ident
with a `~`. The state tracking registers this and keeps `~${NICK}` as
the current and the new ident for future reconnects. However, `~` is not
a valid char for the `<user>` part in the `USER` command, at least not
for ngIRCd.
To clarify this behaviour, take a look at the following log. First, the
initial connection is begin established correctly. Keep an eye on the
`USER` command being sent to the server.
> http.go:132: INFO Starting HTTP server
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__
Now, there was a network incident and the session needs to be recreated.
> connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring
> irc.go:300: INFO Connecting to IRC __SERVER__
> backoff.go:111: INFO Backoff for 0s starts
> backoff.go:114: INFO Backoff for 0s ends
> connection.go:390: INFO irc.Connect(): Connecting to __SERVER__.
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- ERROR :Invalid user name
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> irc.go:319: WARN Receiving a session down before the session is up, this is odd
This time, the used `user` part of the `USER` command has the prefixed
`~` and fails. However, without using `-debug` and taking a very close
look, this error can be missed very easy.
As the new ident is invalid, the alertmanager-irc-relay is now stuck in
an endless reconnection loop.
This fix is kind of straight forward and just checks if the ident has
changed before trying to reconnect. It might not be the prettiest
solution, but recreating the whole *irc.Config resulted in other bugs as
it was still referenced - even after being `Close`d.
Detect connection loss through IRC PONG
In the current state, the alertmanager-irc-relay already sends minutely
IRC PINGs. This allows to check the IRC connection's health in protocol
without having to deal with specific TCP settings. However, even when
we are sending those PINGs, we don't process the server's PONGs or their
absence.
On one of my alertmanager-irc-relay instances, the time between the last
received PONG and the TCP read to fail was round about fifteen minutes.
All this time, the connection was already dead, but there was no attempt
to reestablish it.
The introduces changes keep book on the last received PONG and fails if
there was no new PONG within twice the pingFrequencySecs time. When
establishing a new connection during the SetupPhase, the current time
will be set as the last PONG's time to reset the time comparison.
I have rebased my changes against the latest development in the master
branch and have successfully verified that they are still working and not breaking anything else.
Thank you very much for your contribution and for fixing the bugs.
I'd like to make sure we do not accidentally reintroduce these issues in the future.
In order to do so, may I ask you to write a couple of tests in irc_test.go ?
Thank you!
Thanks for your kind review.
I have tried created tests for both use cases and just force-pushed them:
-
TestReconnectNickIdentChange
sends back a changed nick ident and checks how the client reacts. -
TestReconnectMissingPong
runs with a minimal PING interval of a second and just waits for the client to reconnect. -
TestPingPong
also runs with a minimal PING interval but supports PING and let the client and server send three PING/PONGs back and forth.
To be able to alter the PING interval, I have refactored the previously constant pingFrequencySecs
into a field of the Config
struct, as otherwise more in-depth changes were necessary just for those two tests.
Please feel free to take another look. Thanks.
Just force-pushed the second PONG-commit again to fix the TestReconnectMissingPong
against a race between goirc's next PING and the PONG timeout when using an interval of only one second. Furthermore, replaced time.Now().Sub(…)
by time.Since(…)
.
I reworked the closing of the IRC connection and re-run the tests successfully within the GitHub Actions, https://github.com/oxzi/alertmanager-irc-relay/actions/runs/3403900480.
However, I did some bulk tests - running multiple go test -v -count=32
and also -count=128
both on Go 1.15 and Go 1.17 - and detected some rare random failures on different tests. For example, on TestConnectErrorRetry
, TestKickRejoin
, and TestSendAlertDisconnected
I experienced timeouts of ten minutes and on other tests sometimes the final QUIT :see ya
message was missing. For reference, I also experienced the same timeouts on the current master
branch without these changes.
@shammash: Would you like to take a look? Especially as, at least some of those test failures, are not related to my proposed changes.
On another note, I am running those changes on my systems since I opened the PR and it successfully survived outage on both sides, including one downtime and an OS upgrade on the IRC server's end.
This PR is also relevant to me. Thanks a lot to @oxzi; please consider merging.
@shammash is there something still missing? Read through the PR and it seems fine to me. Looks mergable.