Support remote write v2 by converting request
This PR supports Prometheus remote write 2.0 by converting the v2 request to v1 at the API.
Which issue(s) this PR fixes: Fixes #6324
Checklist
- [x] Tests updated
- [ ] Documentation added
- [x]
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]
Looks promising. Thanks!
FYI we have https://github.com/prometheus/client_golang/pull/1658 which exports remote write handler. Not a blocker for this PR but we should keep it on our radar to switch to use the library
@yeya24 Thanks for letting me know. Should we make the issue to track it?
Maybe we can open a issue for someone give a try to use the client_golang handler even before it get merged so we can give feedback on the open PR. Changing that handler after is merged probably will be more difficult as it could potentially break all projects that are already using it.
@alanprot I added a comment here: https://github.com/cortexproject/cortex/issues/6324
I took a breif look at https://github.com/prometheus/client_golang/pull/1658. Left some comments there and we have some changes Cortex specific that might not make sense for Prometheus. I think we are ok to proceed with this PR first.
@yeya24 Thanks. I read it, and it would be good if we could reuse its functions!
@yeya24
I added -distributor.remote-write2-enabled flags to configure whether the Distributor can accept PRW2.0.
I added the TestIngesterRollingUpdate e2e test, where the Distributor can accept PRW2.0 and the Ingester uses the v1.18.1 image.
The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0".
Is it expecting behavior?
The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0". Is it expecting behavior?
This doesn't sound like the right behavior. Is that what you got with this PR?
@yeya24 Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result. If the Ingester and Distributor use the same images (PRW 2.0 implemented), we can get the expected response header.
Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result.
I see what you meant. Then it is expected to get that result if you use Ingester of old version and Distributor of new version. That's why we introduce the PRW 2.0 feature flag in distributor to only enable PRW 2.0 request if backend Ingester is running the newer version.
@yeya24
Should I add comments to -distributor.remote-write2-enabled so that the user can do a rolling update of the Ingesters first and then update the Distributor afterward?
We can mention it in the flag description but I doubt users really look at it. I prefer to create a dedicated doc/guide for users to migrate to Prometheus 3.0
@yeya24 Yes, the guide docs would be more good.
Hello @SungJin1212, thank you for opening this PR.
There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under ## master / unreleased.
Thanks, Charlie
I think it is time to revisit this now : )
Ping for review. @danielblando @alanprot @friedrichg
@yeya24
Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true?
If we set the protobuf_message to io.prometheus.write.v2.Request on the Prometheus, the Prometheus sends CT and NH by default. (remote write config)
Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true?
No, I think it is more of a user's responsibility to enable it. They should enable it on Ingester first for -blocks-storage.tsdb.enable-native-histograms and then enable -distributor.remote-write2-enabled on Distributor.
Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?
@yeya24
Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?
No, the cortex fails to ingest the native histograms. The Prometheus side error message is:
source=queue_manager.go:1672 msg="we got 2xx status code from the Receiver yet statistics indicate some dat was not written; investigation needed" component=remote remote_name=09e380 url=http://host.docker.internal:57765/api/v1/push failedSampleCount=0 failedHistogramCount=4 failedExemplarCount=0
Sorry for the HUGE late review. Thanks a lot for the work. Overall lgtm, just the header count i am worried about. We can discuss of having as approximation if replication is enabled. If replication is disabled it will be easier