alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alertmanager does not read telegram token file until there is an alert to send.

Open NorfairKing opened this issue 5 months ago • 7 comments

I had an outage last night and did not get an alert because:

Jul 17 08:15:03 nixos alertmanager[233766]: time=2025-07-17T08:15:03.711Z level=ERROR source=dispatch.go:360 msg="Notify for alerts failed" component=dispatcher num_alerts=2 err="telegram/telegram[0]: notify retry canceled after 11 attempts: could not read /run/agenix/telegram-bot-token: open /run/agenix/telegram-bot-token: permission denied"

Alertmanager could try to read that file when starting instead of when there is an alert to send.

NorfairKing avatar Jul 17 '25 08:07 NorfairKing

I agree, this is a valid concern. I was looking through the code and had an idea for how we might be able to address this. Perhaps we could add validation logic directly into the New constructor function for each notifier? This would allow Alertmanager to check the configuration on startup and fail immediately if a notifier is set up incorrectly, providing much faster feedback.

From what I can tell, the initialization flows through the following path:

  1. During startup, the main function calls a process that uses BuildReceiverIntegrations to set up notifiers: https://github.com/prometheus/alertmanager/blob/c653800ffe4af31f4c3f9d81708301f3dd5600fa/cmd/alertmanager/main.go#L440

  2. BuildReceiverIntegrations then calls the specific New constructor for each configured integration, like for Telegram here: https://github.com/prometheus/alertmanager/blob/c653800ffe4af31f4c3f9d81708301f3dd5600fa/config/receiver/receiver.go#L46

  3. Which leads to the constructor itself: https://github.com/prometheus/alertmanager/blob/c653800ffe4af31f4c3f9d81708301f3dd5600fa/notify/telegram/telegram.go#L46

Adding validation inside these New methods seems like a pattern that could be applied consistently across all notifiers.

What do the maintainers think of this approach? Happy to hear other thoughts or suggestions.

pehlicd avatar Jul 17 '25 12:07 pehlicd

@pehlicd We can go even further and validate that the credentials are valid according to telegram as well: https://telegram-bot-sdk.readme.io/reference/getme

NorfairKing avatar Jul 17 '25 12:07 NorfairKing

This to me feels like something that could go in the amtool check-config command?

grobinson-grafana avatar Jul 20 '25 10:07 grobinson-grafana

@grobinson-grafana Maybe ? I think that command is currently pure (i.e. no network or filesystem access) and maybe we want to keep it that way?

NorfairKing avatar Jul 20 '25 11:07 NorfairKing

Just one thing to keep in mind is just because file existence is checked at startup time doesn't mean it will still exist at time of use. For example configuration management might have deleted the file to rewrite it but failed. So the file was validated on startup, but the notification still fails because the file was deleted after startup (could be weeks after startup).

For this to truly be robust you will not only want to validate the file exists at startup time but also read it into memory at that time too?

grobinson-grafana avatar Jul 20 '25 14:07 grobinson-grafana

Alertmanager will also reload the configuration when asked. That documentation notes that it will validate the configuration and refuse to reload the configuration if it is not valid. I would assume that we want the same validation to span both startup and reload checks.

sysadmind avatar Jul 21 '25 14:07 sysadmind

So the file was validated on startup, but the notification still fails because the file was deleted after startup (could be weeks after startup).

I think if we read it at startup we should then keep the value in memory inside the configured object, not re-read it for each notification.

ultrotter avatar Nov 14 '25 13:11 ultrotter