alertmanager-irc-relay Fix Reconnecting Issues by Fixing Nick Ident and IRC PONGs

Previously, I have experienced reconnecting issues when running the alertmanager-irc-relay as an unauthorized IRC user against a ngIRCd as goirc's state tracking learns an invalid nick ident. This has resulted in an endless reconnection loop. Furthermore, it took round about fifteen minutes to detect a connection loss, as IRC PONGs were not counted. As my "production server" has a somewhat challenging uplink, the alertmanager-irc-relay's connection dies on a weekly basis.

This PR introduces both support to restore the original nick ident before attempting to reconnect and to establish an earlier reconnection based on the absence of PONGs. I had those changes running in "production" for round about a week and could now verify successfully reconnects after a short time. Furthermore, I have tested those changes on my own machine against the same IRC server with a simulated packet or connection loss.

More details are available in the commit messages:

   Restore IRC Ident on Reconnect

   After a connection loss on an IRC session with a ngIRCd, the
   alertmanager-irc-relay was unable to reconnect. After some debugging,
   the error's origin was the state tracking within the used goirc library.
   When using an unidentified session, ngIRCd prefixes the user's ident
   with a `~`. The state tracking registers this and keeps `~${NICK}` as
   the current and the new ident for future reconnects. However, `~` is not
   a valid char for the `<user>` part in the `USER` command, at least not
   for ngIRCd.

   To clarify this behaviour, take a look at the following log. First, the
   initial connection is begin established correctly. Keep an eye on the
   `USER` command being sent to the server.

   > http.go:132: INFO Starting HTTP server
   > irc.go:308: INFO Connected to IRC server, waiting to establish session
   > connection.go:543: DEBUG -> NICK alertbot
   > connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay
   > connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__

   Now, there was a network incident and the session needs to be recreated.

   > connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out
   > connection.go:577: INFO irc.Close(): Disconnected from server.
   > irc.go:150: INFO Disconnected from IRC
   > reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring
   > irc.go:300: INFO Connecting to IRC __SERVER__
   > backoff.go:111: INFO Backoff for 0s starts
   > backoff.go:114: INFO Backoff for 0s ends
   > connection.go:390: INFO irc.Connect(): Connecting to __SERVER__.
   > irc.go:308: INFO Connected to IRC server, waiting to establish session
   > connection.go:543: DEBUG -> NICK alertbot
   > connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay
   > connection.go:474: DEBUG <- ERROR :Invalid user name
   > connection.go:577: INFO irc.Close(): Disconnected from server.
   > irc.go:150: INFO Disconnected from IRC
   > irc.go:319: WARN Receiving a session down before the session is up, this is odd

   This time, the used `user` part of the `USER` command has the prefixed
   `~` and fails. However, without using `-debug` and taking a very close
   look, this error can be missed very easy.

   As the new ident is invalid, the alertmanager-irc-relay is now stuck in
   an endless reconnection loop.

   This fix is kind of straight forward and just checks if the ident has
   changed before trying to reconnect. It might not be the prettiest
   solution, but recreating the whole *irc.Config resulted in other bugs as
   it was still referenced - even after being `Close`d.

    Detect connection loss through IRC PONG

    In the current state, the alertmanager-irc-relay already sends minutely
    IRC PINGs. This allows to check the IRC connection's health in protocol
    without having to deal with specific TCP settings. However, even when
    we are sending those PINGs, we don't process the server's PONGs or their
    absence.

    On one of my alertmanager-irc-relay instances, the time between the last
    received PONG and the TCP read to fail was round about fifteen minutes.
    All this time, the connection was already dead, but there was no attempt
    to reestablish it.

    The introduces changes keep book on the last received PONG and fails if
    there was no new PONG within twice the pingFrequencySecs time. When
    establishing a new connection during the SetupPhase, the current time
    will be set as the last PONG's time to reset the time comparison.

I have rebased my changes against the latest development in the master branch and have successfully verified that they are still working and not breaking anything else.

Oct 20 '22 16:10 oxzi

Thank you very much for your contribution and for fixing the bugs.

I'd like to make sure we do not accidentally reintroduce these issues in the future.

In order to do so, may I ask you to write a couple of tests in irc_test.go ?

Thank you!

Oct 24 '22 13:10 shammash

Thanks for your kind review.

I have tried created tests for both use cases and just force-pushed them:

TestReconnectNickIdentChange sends back a changed nick ident and checks how the client reacts.
TestReconnectMissingPong runs with a minimal PING interval of a second and just waits for the client to reconnect.
TestPingPong also runs with a minimal PING interval but supports PING and let the client and server send three PING/PONGs back and forth.

To be able to alter the PING interval, I have refactored the previously constant pingFrequencySecs into a field of the Config struct, as otherwise more in-depth changes were necessary just for those two tests.

Please feel free to take another look. Thanks.

Oct 24 '22 20:10 oxzi

Just force-pushed the second PONG-commit again to fix the TestReconnectMissingPong against a race between goirc's next PING and the PONG timeout when using an interval of only one second. Furthermore, replaced time.Now().Sub(…) by time.Since(…).

Oct 24 '22 21:10 oxzi

I reworked the closing of the IRC connection and re-run the tests successfully within the GitHub Actions, https://github.com/oxzi/alertmanager-irc-relay/actions/runs/3403900480.

However, I did some bulk tests - running multiple go test -v -count=32 and also -count=128 both on Go 1.15 and Go 1.17 - and detected some rare random failures on different tests. For example, on TestConnectErrorRetry, TestKickRejoin, and TestSendAlertDisconnected I experienced timeouts of ten minutes and on other tests sometimes the final QUIT :see ya message was missing. For reference, I also experienced the same timeouts on the current master branch without these changes.

@shammash: Would you like to take a look? Especially as, at least some of those test failures, are not related to my proposed changes.

On another note, I am running those changes on my systems since I opened the PR and it successfully survived outage on both sides, including one downtime and an OS upgrade on the IRC server's end.

Nov 06 '22 11:11 oxzi

This PR is also relevant to me. Thanks a lot to @oxzi; please consider merging.

Dec 22 '22 05:12 riotbib

@shammash is there something still missing? Read through the PR and it seems fine to me. Looks mergable.

Jan 24 '23 14:01 flxai

alertmanager-irc-relay alertmanager-irc-relay copied to clipboard

Fix Reconnecting Issues by Fixing Nick Ident and IRC PONGs

alertmanager-irc-relay
alertmanager-irc-relay copied to clipboard