cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Support remote write v2 by converting request

Open SungJin1212 opened this issue 1 year ago • 19 comments

This PR supports Prometheus remote write 2.0 by converting the v2 request to v1 at the API.

Which issue(s) this PR fixes: Fixes #6324

Checklist

  • [x] Tests updated
  • [ ] Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

SungJin1212 avatar Nov 11 '24 06:11 SungJin1212

Looks promising. Thanks!

FYI we have https://github.com/prometheus/client_golang/pull/1658 which exports remote write handler. Not a blocker for this PR but we should keep it on our radar to switch to use the library

yeya24 avatar Nov 14 '24 21:11 yeya24

@yeya24 Thanks for letting me know. Should we make the issue to track it?

SungJin1212 avatar Nov 15 '24 00:11 SungJin1212

Maybe we can open a issue for someone give a try to use the client_golang handler even before it get merged so we can give feedback on the open PR. Changing that handler after is merged probably will be more difficult as it could potentially break all projects that are already using it.

alanprot avatar Nov 15 '24 19:11 alanprot

@alanprot I added a comment here: https://github.com/cortexproject/cortex/issues/6324

SungJin1212 avatar Nov 15 '24 22:11 SungJin1212

I took a breif look at https://github.com/prometheus/client_golang/pull/1658. Left some comments there and we have some changes Cortex specific that might not make sense for Prometheus. I think we are ok to proceed with this PR first.

yeya24 avatar Nov 17 '24 20:11 yeya24

@yeya24 Thanks. I read it, and it would be good if we could reuse its functions!

SungJin1212 avatar Nov 18 '24 00:11 SungJin1212

@yeya24 I added -distributor.remote-write2-enabled flags to configure whether the Distributor can accept PRW2.0. I added the TestIngesterRollingUpdate e2e test, where the Distributor can accept PRW2.0 and the Ingester uses the v1.18.1 image. The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0". Is it expecting behavior?

SungJin1212 avatar Nov 27 '24 02:11 SungJin1212

The result is PRW2.0 push is a success, but the response header (X-Prometheus-Remote-Write-xxx) values are all "0". Is it expecting behavior?

This doesn't sound like the right behavior. Is that what you got with this PR?

yeya24 avatar Dec 03 '24 02:12 yeya24

@yeya24 Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result. If the Ingester and Distributor use the same images (PRW 2.0 implemented), we can get the expected response header.

SungJin1212 avatar Dec 03 '24 02:12 SungJin1212

Yes, the test condition is that Ingester uses a v1.18.1 image and Distributor uses a PRW2.0-implemented one. The PRW2.0 push then gets that result.

I see what you meant. Then it is expected to get that result if you use Ingester of old version and Distributor of new version. That's why we introduce the PRW 2.0 feature flag in distributor to only enable PRW 2.0 request if backend Ingester is running the newer version.

yeya24 avatar Dec 03 '24 03:12 yeya24

@yeya24 Should I add comments to -distributor.remote-write2-enabled so that the user can do a rolling update of the Ingesters first and then update the Distributor afterward?

SungJin1212 avatar Dec 03 '24 06:12 SungJin1212

We can mention it in the flag description but I doubt users really look at it. I prefer to create a dedicated doc/guide for users to migrate to Prometheus 3.0

yeya24 avatar Dec 03 '24 17:12 yeya24

@yeya24 Yes, the guide docs would be more good.

SungJin1212 avatar Dec 04 '24 00:12 SungJin1212

Hello @SungJin1212, thank you for opening this PR.

There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under ## master / unreleased.

Thanks, Charlie

CharlieTLe avatar Jan 23 '25 03:01 CharlieTLe

I think it is time to revisit this now : )

yeya24 avatar Mar 03 '25 23:03 yeya24

Ping for review. @danielblando @alanprot @friedrichg

yeya24 avatar Apr 08 '25 23:04 yeya24

@yeya24 Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true? If we set the protobuf_message to io.prometheus.write.v2.Request on the Prometheus, the Prometheus sends CT and NH by default. (remote write config)

SungJin1212 avatar Apr 14 '25 01:04 SungJin1212

Should we automatically enable -blocks-storage.tsdb.enable-native-histograms if the -distributor.remote-write2-enabled is true?

No, I think it is more of a user's responsibility to enable it. They should enable it on Ingester first for -blocks-storage.tsdb.enable-native-histograms and then enable -distributor.remote-write2-enabled on Distributor.

Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?

yeya24 avatar Apr 24 '25 17:04 yeya24

@yeya24

Would Prometheus fallback to send V1 protocol if Cortex doesn't ingest native histograms?

No, the cortex fails to ingest the native histograms. The Prometheus side error message is:

source=queue_manager.go:1672 msg="we got 2xx status code from the Receiver yet statistics indicate some dat was not written; investigation needed" component=remote remote_name=09e380 url=http://host.docker.internal:57765/api/v1/push failedSampleCount=0 failedHistogramCount=4 failedExemplarCount=0

SungJin1212 avatar Apr 24 '25 23:04 SungJin1212

Sorry for the HUGE late review. Thanks a lot for the work. Overall lgtm, just the header count i am worried about. We can discuss of having as approximation if replication is enabled. If replication is disabled it will be easier

danielblando avatar Jul 15 '25 21:07 danielblando