alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

API to validate config before saving

Open agnello-noronha opened this issue 3 months ago • 15 comments

Provide an API to run amtool config check. This will help in validating config before saving/reloading configuration

agnello-noronha avatar Oct 14 '25 09:10 agnello-noronha

You only mean to check the general syntax, correct? Does this program solve your problem?

package main

import (
	"fmt"
	"os"

	"github.com/alecthomas/kingpin/v2"
	"github.com/prometheus/alertmanager/config"
)

func main() {
	configFile := kingpin.Flag("config.file", "Alertmanager configuration file name.").Default("alertmanager.yml").String()

	kingpin.Parse()

	_, err := config.LoadFile(*configFile)
	if err != nil {
		println(fmt.Sprintf("ERROR: %s", err))
		os.Exit(78)
	}
	println("OK.")
}

SoloJacobs avatar Oct 24 '25 20:10 SoloJacobs

Hi, No not general syntax, we want to do the exact validations whatever amtool does.

if a api is available to do something like -> i pass a yaml string -> it does "amtool check-config yaml-string" on the passed object, it will be very helpful for us

zeroEntropyy avatar Nov 06 '25 06:11 zeroEntropyy

Ok, so this means you want to take the existing amtool check-config command and expose it as an endpoint?

SoloJacobs avatar Nov 09 '25 21:11 SoloJacobs

@SoloJacobs Yes that would be great, as before committing changes to alert manager a check will be helpful.

agnello-noronha avatar Nov 10 '25 03:11 agnello-noronha

What's wrong with a pipeline step which runs amtool check-config in your CI/CD pipeline that ships the config change?

TheMeier avatar Nov 11 '25 18:11 TheMeier

We deploy alertmanager with helm chart and update configmap, it will check and reject the config but upon restart alertmanager will go to crash loopback state due to faulty config in configmap.

The API we requested will be helpful in validating this config before updating config map

agnello-noronha avatar Nov 12 '25 07:11 agnello-noronha

I also use helm and prometrheus operator. My deployment pipeline has a stage before applying the config which renders the the config, runs amtool check config and the result is then used in a later stage to deploy the k8s secret with that config consumed by the altermanager resource.

In any cas you need to run some command to do the validations. Why is posting your config to a http-api for validation better or easier than calling amtool ?

TheMeier avatar Nov 12 '25 08:11 TheMeier

We have a different service to manage config. We have to fail early and show the exception when user try to save config. Since alert manager runs as a different service we need an api to validate the config before pushing to config map

agnello-noronha avatar Nov 12 '25 14:11 agnello-noronha

To be honest the issue still not so clear to me. If I read this correctly, the service you use to manage the config map will use the endpoint of alertmanager to verify the configuration. But what is special about this service, that it could not use amtool?

Ideally, I would need some kind of minimal setup so I can play around with the problem on my machine. But before you invest the effort: I currently feel that I have to prioritize some other issues, and therefore can't promise that this feature will ever completed.

SoloJacobs avatar Nov 12 '25 20:11 SoloJacobs

Say I have a java or python service, how do we run amtool? Are you suggesting to bundle amtool along with service and run the tool?

agnello-noronha avatar Nov 13 '25 04:11 agnello-noronha

We have exposed an api using which you can update the alertmanager config.

How we do it is we directly update the kubernetes configmap at runtime. Once the configmap is updated, amtool picks those changes and runs validation on it. Now if validation fails and alertmanager restarts pod will go into crash loop backoff.

To avoid this currently we are have bundled amtool binary and executing validation commands using ProcessBuilder in java.

Now if alertmanager exposes an api for amtool validation we can skip the ProcessBuilder step which is susceptible to timeout and other command line issues.

zeroEntropyy avatar Nov 13 '25 06:11 zeroEntropyy

I'm sorry to say that this feature does not fit well with the project as a whole, and thus won't be implemented. This decision was made during the bug scrub and it was unanimous.

Here are some suggestions, which I would explore if I was in your position:

  • Alertmanager offers a reload option
curl -X POST http://<alertmanager-host>:<port>/-/reload

This could make it easier to detect whether a faulty configuration was deployed.

  • Fork alertmanager or amtool and extend it with the functionality you need.
  • I still think packaging amtool with your python or java service is the correct way to approach this.

Kind regards

SoloJacobs avatar Nov 16 '25 18:11 SoloJacobs

@SoloJacobs We have tried all available solutions! I understand your concern, but this usecase will really help our project, can i try raising a pull request? If you would be willing to accept it, I will try to solve this?

zeroEntropyy avatar Nov 16 '25 18:11 zeroEntropyy

As I already mentioned, I did not make this decision by myself. You would have to convince a number of people. The implementation is not the main concern:

You have give an explanation why the avenues I have provided are not viable. This explanation needs to make sense to an outsider of your company (like me).

That being said, providing an implementation will certainly help your chances overall. Just be aware that is still likely to be declined.

SoloJacobs avatar Nov 16 '25 18:11 SoloJacobs

Ya, I understand the concerns that come along with a big project like alertmanager, I will try my best and raise a pull request to resolve this use-case asap, and I will hope that you pick it in some of your upcoming release.

I will try to raise a pull request within a month, can you leave the issue open till then please?

You can close the issue if there is no progress even after a month.

zeroEntropyy avatar Nov 16 '25 18:11 zeroEntropyy