VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

Testing alert rules in a pure Victoria environment/testing alerts using MetricsQL.

Open AeroNotix opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe.

Prometheus comes with promtool, which has a test rules subcommand.

This is useful for testing alerting rules alongside the creation of timeseries which should trigger that alert, this mode of testing is useful since it is all configurable in code, can be ran in CI/CD systems and only needs a handful of packages to make it work.

The issue is promtool test rules doesn't understand MetricsQL and throws errors when attempting to write and test alerts using any MetricsQL functionality.

While it is possible to just ignore running tests for anything which uses MetricsQL - it's less than ideal. Therefore it appears that the choices to continue on are:

  • Stop using MetricsQL in alerts/recording rules
  • Don't test anything which uses MetricsQL
  • Run vmalert in replay mode in CI/CD, which would require running a Prometheus-compatible API also AND filled with metrics.
  • Create this ticket to start the conversation about creating equivalent functionality for testing alerts in vmalert.

Describe the solution you'd like This would be ideal:

vmctl test rules

AeroNotix avatar Aug 05 '22 10:08 AeroNotix

Hi @AeroNotix have you tried vmalert dryRun option? Does it cover your need?

 -dryRun -rule
     Whether to check only config files without running vmalert. The rules file are validated. The -rule flag must be specified.

tenmozes avatar Aug 07 '22 10:08 tenmozes

No, I haven't. However, I don't need to know if the rules are only syntactically correct but also I want to test them by unit testing them with synthetic timeseries and asserting they trigger the rules.

Please take a look at promtool test rules to see.

AeroNotix avatar Aug 07 '22 17:08 AeroNotix

vmalert doesn't provide the ability to create and run unit tests for alerting rules as promtool test rules does.

@hagen1778 , could you look into this feature request? It can be useful as a general framework for MetricsQL unit testing.

valyala avatar Aug 07 '22 22:08 valyala

Appreciate it!

AeroNotix avatar Aug 09 '22 21:08 AeroNotix

@valyala @hagen1778 any updates on this one? promtool test rules is an extremely handy tool at my place and I would love to have some equivalent with VictoriaMetrics.

lmarszal avatar Jan 04 '23 14:01 lmarszal

Ability to have full debug mode of vmalert evaluating alert rules would be nice. You feed rule file and get: data points retrieved, result of expression, alert status change if any. As an option test data points could be supplied via external file.

codedumper1 avatar Feb 10 '23 14:02 codedumper1

Ability to have full debug mode of vmalert evaluating alert rules would be nice

vmalert supports debug mode, but it is for real-time processing - see https://docs.victoriametrics.com/vmalert.html#troubleshooting

hagen1778 avatar Feb 10 '23 15:02 hagen1778

Ability to have full debug mode of vmalert evaluating alert rules would be nice

vmalert supports debug mode, but it is for real-time processing - see https://docs.victoriametrics.com/vmalert.html#troubleshooting

not sure what you mean, if you have in mind debug: per rule, it still inadequate verbosity to understand what's going on and why rule isn't triggering.

codedumper1 avatar Feb 10 '23 20:02 codedumper1

it still inadequate verbosity to understand what's going on and why rule isn't triggering.

Do you mean verbosity in log messages?

why rule isn't triggering.

What do you think about state update history on rule's Details page?

hagen1778 avatar Feb 13 '23 08:02 hagen1778

Do you mean verbosity in log messages?

yes. Full invocation, data gathering, expression checking trace is needed. Please see this issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3802. It illustrates exact situation - "last updates" provides curl request that gives result, but debug message in log complains "query returned 0 samples". How this can be, it's confusing.

What do you think about state update history on rule's Details page?

Good attempt but currently confusing because cURL request provided does return actual value, but that contradicts debug message "returned 0 samples" in log. More information is needed. Please see issue url above that illustrates problem.

What I have in mind is "trace" of every alert. Something like:

timestamp: evaluating alert "MyAlert" timestamp: datasource.lookback=X, -datasource.queryStep=Y, for=0, retrieving data samples, got: 1 5 5 5 5 timestamp: running expression: sum(cloudwatch_aws_elastic_map_reduce_core_nodes_running_average) by (job_flow_id) >= bool 3, got result 1 timestamp: for is "0", triggering alert, sent notification to notifier.url timestamp: saving state to xyz ... and so on.

Maybe such trace log can be provided in "last updates" WebUI together with cURL line for those alerts that have 'debug: true' configured.

codedumper1 avatar Feb 13 '23 13:02 codedumper1

timestamp: evaluating alert "MyAlert" timestamp: datasource.lookback=X, -datasource.queryStep=Y, for=0, retrieving data samples, got: 1 5 5 5 5 timestamp: running expression: sum(cloudwatch_aws_elastic_map_reduce_core_nodes_running_average) by (job_flow_id) >= bool 3, got result 1 timestamp: for is "0", triggering alert, sent notification to notifier.url timestamp: saving state to xyz

Vmalert already prints this in debug messages, except timestamp: running expression: sum thing. Because vmalert does not compare any values or conditions. It is as easy as: fire alerts for everything that was returned from datasource. And if for>0 - wait for <for> before firing.

See the example of debug messages from Troubleshooting section:

2022-09-15T13:35:41.155Z  DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:41+02:00: query returned 0 samples (elapsed: 5.896041ms)
2022-09-15T13:35:56.149Z  DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29+by%28instance%29+%3E+0&step=15s&time=1663248945"
2022-09-15T13:35:56.178Z  DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: query returned 1 samples (elapsed: 28.368208ms)
2022-09-15T13:35:56.178Z  DEBUG datasource request: executing POST request with params "denyPartialResponse=true&query=sum%28vm_tcplistener_conns%7Binstance%3D%22localhost%3A8429%22%7D%29&step=15s&time=1663248945"
2022-09-15T13:35:56.179Z  DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:35:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} created in state PENDING
...
2022-09-15T13:36:56.153Z  DEBUG rule "TestGroup":"Conns" (2601299393013563564) at 2022-09-15T15:36:56+02:00: alert 10705778000901301787 {alertgroup="TestGroup",alertname="Conns",cluster="east-1",instance="localhost:8429",replica="a"} PENDING => FIRING: 1m0s since becoming active at 2022-09-15 15:35:56.126006 +0200 CEST m=+39.384575417

It shows what and when was executed, what was returned, to which state it transferred the alert object. All that you've said except running expression part.

hagen1778 avatar Feb 13 '23 13:02 hagen1778

You see, causes why something isn't working are still hidden. For example I was stuck with problem caused by datasource.lookback being too short. Maybe debug message that at least indicates "returned 0 samples within lookback period" would be useful. Although I would prefer trace that would help catch problems such as 'lookback' and alike.

codedumper1 avatar Feb 14 '23 15:02 codedumper1

vmalert gains the ability to unit-test alerting rules starting from v1.92.0. See these docs for details.

Closing the feature request as done.

valyala avatar Jul 28 '23 01:07 valyala

Unfortunately, this feature was reverted in https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4734 We should come up with better idea for implementing it.

hagen1778 avatar Jul 28 '23 13:07 hagen1778

Unit testing for rules is now supported by vmalert-tool starting from v1.95.0

hagen1778 avatar Nov 21 '23 09:11 hagen1778