Draft Proposal: Write down specifications which will be used for acceptance tests
Alertmanager lack any form of "official specification", there are different behaviour which fall under these categories:
- documented in the docs
- tested in unit tests
- tested in acceptance test
- not tested
In some cases it is hard to understand the motivation behind specific logics since it is not well documented and commit messages lack context.
These fall into a bucket of things that we are not sure if they are incidental or intentional and therefore can either be dropped or supported.
One example is the Aggregation Group timer resets to zero when an old alert arrives: https://github.com/prometheus/alertmanager/blob/80d0265e16874ab0faf7c4de83cd8e33ac03f23e/dispatch/dispatch.go#L499-L501
(This logic was introduced before clustering).
Should such a logic be kept or removed?
Proposal
Start writing down specifications which can then be used to generate acceptance tests. Each component of Alertmanager will have a specification which it should satisfy. The Application and the cluster will also have specifications. The specification can evolve over time to support more features or deprecate and drop an unused or incidental one.
There are different solution to acheive this but one good example is https://cucumber.io/ Which also supports golang https://github.com/cucumber/godog
Interesting, cucumber seems pretty neat.
I also wonder if we should actually try to write a TLA+ spec for notification algorithm... But that's going to be a pretty big task.