Upgrade upstream Prometheus: add keep_firing_for support to alerting rules and track 499 on canceled requests
What this PR does
This PR updates the upstream Prometheus. The user facing changes are:
- Add
keep_firing_forsupport to alerting rules - Track 499 in
cortex_request_duration_seconds_countfor canceled queries
Note to reviewers
- I'm not a big fan of having to pass the sharding func to our TSDB fork. In particular,
Labels.Hash()is used extensively in Prometheus and there's no easy way to block its usage from Mimir (e.g. can't add afaillintrule for that), or even detect when we should use our function and whenLabels.Hash()is fine. I find it quite fragile. I suggest to keep it as is in this PR (to not block it for long) but I think we should come up with a less-fragile longer term solution. - I manually tested the
keep_firing_forand looks working as expected. I've also manually tested the API endpoints/api/v1/rulesand/api/v1/alerts: they both looks working as Prometheus ones. - I manually tested the status_code used to track canceled queries and this PR fixes https://github.com/grafana/mimir/issues/4071
What has changed in updated vendored libs
Alertmanager changes:
- Migrated to
github.com/hashicorp/golang-lru/v2 - PushoverConfig introduced
UserKeyFileandTokenFile - Refactored
nflog.Logoptions and maintenance (commit)
- github.com/prometheus/alertmanager v0.25.0
+ github.com/prometheus/alertmanager v0.25.1-0.20230119163903-f59460bfd4bf
Prometheus changes:
- Added
KeepFiringFortoRuleandAlertingRule, andKeepFiringSincetoAlert. The 'keep_firing_for' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results. - 499 on canceled request
- TSDB in our fork now requires ShardFunc
- More changes to native histograms (not yet supported by Mimir
main, so didn't review changes in depth)
-replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230125082610-38af345deab1
+replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20230127083412-f1986bc584b6
Checked the diff and AWS SDK just added new endpoints:
- github.com/aws/aws-sdk-go v1.44.159
+ github.com/aws/aws-sdk-go v1.44.187
Which issue(s) this PR fixes or relates to
Fixes #4071
Checklist
- [ ] Tests updated
- [ ] Documentation added
- [x]
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]
Is there any chance to get the shardfunc onto Prometheus main? I feel we need to review any update to tsdb from now to check if it hardcodes using labels.Hash
We should talk to @bboreham about it.
Comment on Slack from @colega:
I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.
I remember we have for distributors but I don't remember for compactor. I will check it.
I think that we need a test in compactor and other features that rely on the stability of the labels hash that checks that given list of series falls into the expected list of shards.
I remember we have for distributors but I don't remember for compactor. I will check it.
I confirm we have one for distributors: https://github.com/grafana/mimir/blob/a9a30e00948f710eadcab376d896e0ffd45b3a28/pkg/distributor/distributor_test.go#L1306