mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Upgrade upstream Prometheus: add keep_firing_for support to alerting rules and track 499 on canceled requests

Open pracucci opened this issue 2 years ago • 2 comments

What this PR does

This PR updates the upstream Prometheus. The user facing changes are:

  • Add keep_firing_for support to alerting rules
  • Track 499 in cortex_request_duration_seconds_count for canceled queries

Note to reviewers

  • I'm not a big fan of having to pass the sharding func to our TSDB fork. In particular, Labels.Hash() is used extensively in Prometheus and there's no easy way to block its usage from Mimir (e.g. can't add a faillint rule for that), or even detect when we should use our function and when Labels.Hash() is fine. I find it quite fragile. I suggest to keep it as is in this PR (to not block it for long) but I think we should come up with a less-fragile longer term solution.
  • I manually tested the keep_firing_for and looks working as expected. I've also manually tested the API endpoints /api/v1/rules and /api/v1/alerts: they both looks working as Prometheus ones.
  • I manually tested the status_code used to track canceled queries and this PR fixes https://github.com/grafana/mimir/issues/4071

What has changed in updated vendored libs

Alertmanager changes:

  • Migrated to github.com/hashicorp/golang-lru/v2
  • PushoverConfig introduced UserKeyFile and TokenFile
  • Refactored nflog.Log options and maintenance (commit)
-       github.com/prometheus/alertmanager v0.25.0
+       github.com/prometheus/alertmanager v0.25.1-0.20230119163903-f59460bfd4bf

Prometheus changes:

  • Added KeepFiringFor to Rule and AlertingRule, and KeepFiringSince to Alert. The 'keep_firing_for' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results.
  • 499 on canceled request
  • TSDB in our fork now requires ShardFunc
  • More changes to native histograms (not yet supported by Mimir main, so didn't review changes in depth)
-replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230125082610-38af345deab1
+replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230127083412-f1986bc584b6

Checked the diff and AWS SDK just added new endpoints:

-       github.com/aws/aws-sdk-go v1.44.159
+       github.com/aws/aws-sdk-go v1.44.187

Which issue(s) this PR fixes or relates to

Fixes #4071

Checklist

  • [ ] Tests updated
  • [ ] Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci avatar Jan 27 '23 13:01 pracucci

Is there any chance to get the shardfunc onto Prometheus main? I feel we need to review any update to tsdb from now to check if it hardcodes using labels.Hash

We should talk to @bboreham about it.

pracucci avatar Jan 27 '23 14:01 pracucci

Comment on Slack from @colega:

I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.

I remember we have for distributors but I don't remember for compactor. I will check it.

pracucci avatar Jan 27 '23 15:01 pracucci

I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.

I remember we have for distributors but I don't remember for compactor. I will check it.

I confirm we have one for distributors: https://github.com/grafana/mimir/blob/a9a30e00948f710eadcab376d896e0ffd45b3a28/pkg/distributor/distributor_test.go#L1306

pracucci avatar Jan 30 '23 11:01 pracucci