mimir Help request to troubleshoot MimirContinuousTestFailed alert

Describe the bug

Hello, we are using continuous-test with the mixin to monitor our Mimir clusters. I’m unsure about how to troubleshoot MimirContinuousTestFailed alert for the write-read-series check. We have a couple cases during the last few days (see log messages below)

The alerts come from different cluster monitored by different continuous-test and apart from those there is nothing else in continuous-test logs nor mimir’s logs. If I had to guess I would say this could be related to sample not yet being visible by the queriers something like https://github.com/grafana/mimir/issues/3764

At this point I'm unsure to say it is a bug and I would like to request guidance about how can we troubleshoot such issue?

To Reproduce

At this time I've no succeed in reproducing the error Steps to reproduce the behavior:

Start Mimir 2.5.0
Monitor it with continuous-test

Expected behavior

I would expect to see more message in continuous-test or mimir to help the troubleshooting

Environment

Infrastructure: bare-metal
Deployment tool: debian package

Additional Context

ts=2023-01-06T21:25:42.87954582Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=false msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-06T21:25:42.853913842Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-08T08:37:17.203808569Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673163420000 end=1673167020000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673167020000 (2023-01-08 08:37:00 +0000 UTC) has value -9505.809885 while was expecting -9510.565167"

ts=2023-01-09T07:15:32.350447487Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673244920000 end=1673248520000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897"

Jan 11 '23 20:01 wilfriedroset

I see two different issues in the example logs.

sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897

Both logs of this type fail the sample comparison for the last one (timestamp of the failed sample matches the query end parameter), but we also run the comparion from the last sample and then backward, so we don't know how many samples are actually different.

You suggested it could be something similar to https://github.com/grafana/mimir/issues/3764, but Mimir guarantees read-after-write. Since in the mimir-continuous-test tool first we write and then we read back, read-after-write should be guaranteed (unless bugs).

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries (runInstantQueryAndVerifyResult) as well?

sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)

This log means there's 1 minute of gap in the output samples. Again, this shouldn't happen.

Are you running the query-frontend results cache with any custom config?

Jan 26 '23 17:01 pracucci

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries runInstantQueryAndVerifyResult) as well?

We also have occurences with runInstantQueryAndVerifyResult but not in the timerange listed above.

ts=2023-01-22T09:30:07.834706559Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runInstantQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) ts=1674217920000 results_cache=false msg="Instant query result check failed" err="expected 1 series in the result but got 0"

Are you running the query-frontend results cache with any custom config?

frontend:
    align_queries_with_step: true
    cache_results: true
    grpc_client_config:
        grpc_compression: snappy
        max_recv_msg_size: 209715200
        max_send_msg_size: 419430400
        tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
        tls_cert_path: [redacted]
        tls_enabled: true
        tls_insecure_skip_verify: true
        tls_key_path: [redacted]
        tls_min_version: VersionTLS13
    log_queries_longer_than: 5s
    parallelize_shardable_queries: true
    results_cache:
        backend: memcached
        compression: snappy
        memcached:
            addresses: dns+mimir-index.memcached.service.consul:11211
            max_get_multi_batch_size: 4096
            max_get_multi_concurrency: 10000
            max_item_size: 0
            timeout: 1s
    scheduler_address: query-scheduler.mimir-grpc.service.consul:9095
frontend_worker:
    grpc_client_config:
        grpc_compression: snappy
        max_recv_msg_size: 209715200
        max_send_msg_size: 419430400
        tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
        tls_cert_path: [redacted]
        tls_enabled: true
        tls_insecure_skip_verify: true
        tls_key_path: [redacted]
        tls_min_version: VersionTLS13
    scheduler_address: query-scheduler.mimir-grpc.service.consul:9095

Jan 31 '23 14:01 wilfriedroset

Commenting to express that we are facing the same issue for some time now on prod and dev-environments. Curious where this goes.

Jan 31 '23 15:01 henry-kuehl

Any updates on this one? Still failing for us intermittently.

Feb 10 '23 09:02 henry-kuehl

I currently don't have the bandwidth to follow up on this. To further debug it, I would try to add more details to the log when the query fail to better understand for how many samples the comparison fail (e.g. is it always the last one or not?).

Feb 10 '23 17:02 pracucci

@pracucci did you have a chance to increase logging output for this? Or shall I just increase the log-level on our end to see what gets printed for the failure-cases?

Mar 03 '23 08:03 henry-kuehl

@henry-kuehl what is the value of align_queries_with_step in your case? For us it is true and we have found that Grafana Cloud is running with false. We are testing it now with false but as the issue is quite sporadic for us, I'm not sure to have a feedback soon.

See: https://github.com/grafana/mimir/commit/7f0354c413

May 09 '23 09:05 wilfriedroset

@wilfriedroset we upgraded to Mimir 2.7.2 two weeks back and now also run with the default value for align_queries_with_step. I still see the alert firing intermittently and could not observe an immediate change in behaviour. How did you even come to the conclusion that this might have an effect?

May 09 '23 19:05 henry-kuehl

We have been trying many things around continuous test without any luck. The reasoning behind looking at align_queries_with_step was that the timestamp used by mimir-continuous-test might be slightly modified by mimir if they are not properly aligned. By reading the code we can see that mimir-continuous-test does align itself. We did not expect any changes regarding this setting, we just wanted to rule out this hypothesis.

May 09 '23 20:05 wilfriedroset

Any update from any parties on this one? Even in the current Mimir-version we see this one failing intermittently.

Aug 18 '23 07:08 henry-kuehl