cortex Support remote write v2 by converting request

This PR supports Prometheus remote write 2.0 by converting the v2 request to v1 at the API.

Which issue(s) this PR fixes: Fixes #6324

Checklist

[x] Tests updated
[ ] Documentation added
[x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Nov 11 '24 06:11 SungJin1212

Looks promising. Thanks!

FYI we have https://github.com/prometheus/client_golang/pull/1658 which exports remote write handler. Not a blocker for this PR but we should keep it on our radar to switch to use the library

Nov 14 '24 21:11 yeya24

@yeya24 Thanks for letting me know. Should we make the issue to track it?

Nov 15 '24 00:11 SungJin1212

Maybe we can open a issue for someone give a try to use the client_golang handler even before it get merged so we can give feedback on the open PR. Changing that handler after is merged probably will be more difficult as it could potentially break all projects that are already using it.

Nov 15 '24 19:11 alanprot

@alanprot I added a comment here: https://github.com/cortexproject/cortex/issues/6324

Nov 15 '24 22:11 SungJin1212

I took a breif look at https://github.com/prometheus/client_golang/pull/1658. Left some comments there and we have some changes Cortex specific that might not make sense for Prometheus. I think we are ok to proceed with this PR first.

Nov 17 '24 20:11 yeya24

@yeya24 Thanks. I read it, and it would be good if we could reuse its functions!

Nov 18 '24 00:11 SungJin1212

@yeya24 I added -distributor.remote-write2-enabled flags to configure whether the Distributor can accept PRW2.0. I added the TestIngesterRollingUpdate e2e test, where the Distributor can accept PRW2.0 and the Ingester uses the v1.18.1 image. The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0". Is it expecting behavior?

Nov 27 '24 02:11 SungJin1212

The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0". Is it expecting behavior?

This doesn't sound like the right behavior. Is that what you got with this PR?

Dec 03 '24 02:12 yeya24

@yeya24 Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result. If the Ingester and Distributor use the same images (PRW 2.0 implemented), we can get the expected response header.

Dec 03 '24 02:12 SungJin1212

Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result.

I see what you meant. Then it is expected to get that result if you use Ingester of old version and Distributor of new version. That's why we introduce the PRW 2.0 feature flag in distributor to only enable PRW 2.0 request if backend Ingester is running the newer version.

Dec 03 '24 03:12 yeya24

@yeya24 Should I add comments to -distributor.remote-write2-enabled so that the user can do a rolling update of the Ingesters first and then update the Distributor afterward?

Dec 03 '24 06:12 SungJin1212

We can mention it in the flag description but I doubt users really look at it. I prefer to create a dedicated doc/guide for users to migrate to Prometheus 3.0

Dec 03 '24 17:12 yeya24

@yeya24 Yes, the guide docs would be more good.

Dec 04 '24 00:12 SungJin1212

Hello @SungJin1212, thank you for opening this PR.

There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under ## master / unreleased.

Thanks, Charlie

Jan 23 '25 03:01 CharlieTLe

I think it is time to revisit this now : )

Mar 03 '25 23:03 yeya24

Ping for review. @danielblando @alanprot @friedrichg

Apr 08 '25 23:04 yeya24

@yeya24 Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true? If we set the protobuf_message to io.prometheus.write.v2.Request on the Prometheus, the Prometheus sends CT and NH by default. (remote write config)

Apr 14 '25 01:04 SungJin1212

Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true?

No, I think it is more of a user's responsibility to enable it. They should enable it on Ingester first for -blocks-storage.tsdb.enable-native-histograms and then enable -distributor.remote-write2-enabled on Distributor.

Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?

Apr 24 '25 17:04 yeya24

@yeya24

Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?

No, the cortex fails to ingest the native histograms. The Prometheus side error message is:

source=queue_manager.go:1672 msg="we got 2xx status code from the Receiver yet statistics indicate some dat was not written; investigation needed" component=remote remote_name=09e380 url=http://host.docker.internal:57765/api/v1/push failedSampleCount=0 failedHistogramCount=4 failedExemplarCount=0

Apr 24 '25 23:04 SungJin1212

Sorry for the HUGE late review. Thanks a lot for the work. Overall lgtm, just the header count i am worried about. We can discuss of having as approximation if replication is enabled. If replication is disabled it will be easier

Jul 15 '25 21:07 danielblando