opentelemetry-dotnet [prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests

Fixes #5515

Changes

Fixed a race condition which can cause returning an incorrect response while OpenMetrics requests and PlainText requests are being processed simultaneously.

Merge requirement checklist

[x] CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)

Apr 08 '24 05:04 hez2010

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: hez2010 / name: Steve (7df9d9c742bf4fd0f3620692375ba31f8ecc8431, 224d1bdeace627d7984cb6c4feb70335d4b389d4, 1c843d0d8e1938f8c17136d48fcc2b3d877a7ec6, 2804a7dcc765dc849c0a7a5f447fc026a2f79878, 00149ef38cb0cb2a561cf7eef01742fcaab4ea9e, 0e2551a604a97cdf429a8d7257902d7bafb338e5, c1af8238cd65c4ec7e0eb13c13f1510a237c3fc8, fd01eb269a85d0494d440f876863bbfca1776a6b, 19ae57095501b0b8b4675743dd889acdd24ef8e6, c2e10df9f61d117a09f4732facd05992425a698f, 286d23d74a7efec90f6ed9d046d30a14bb6cfcf2, 3ee60675d2e01d94c056514bd728877aacaaf564, cb957662cb484beb4a4a598bce6f462196488803, 0269b936971cd724abe406dc035a91c352f1ad14, 45dadb239baa60a01896ed024c3706b5fc0c94ea, 580257e696bd30a9f1ba5cc1d878fffb278a6c33, 677b121316690a76caf396fe40bf23a0efa61196, 7ee37127193eb42f30a804ead21a2e6d66ef3a8b, 63f32d66e5db16063589d35d2d687255baaf13bf, 3e003f8144d556474d05d52fe7fe2997baa8e88f, fb54fc8f0807a1daf467b8d01791ff5500bce074, d930f9773d0c68ca805acfb3cd93f15285c93ef7, 3558e9f8b1dab0e09477fb999bae4db8d3fa542a, aaf7e4396107ffa9be585ac7f91aae3f2e0cf80e, 6e5549fcb49ad8d512e7c53632811546f047567e, aed502c389e712f21cf8a7577e9b68618c80137c)
:white_check_mark: login: CodeBlanch / name: Mikel Blanchard (017dd39e31a3a4176950f0ba668690ea72ba5d6f, 7b3b31e46d46ad6b5237a249dff1e763565fb50c)
:white_check_mark: login: vishweshbankwar / name: Vishwesh Bankwar (033148cbbdc5c021fe8c98deddeb19c2b532a62d, 90c348c8bf0944228eee685b5aa304b79fe45048)
:white_check_mark: login: reyang / name: Reiley Yang (665e39076a3abbd497f8deb26f3bdf9845a965da)

Apr 08 '24 05:04 linux-foundation-easycla[bot]

Codecov Report

Attention: Patch coverage is 71.64179% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 85.49%. Comparing base (6250307) to head (665e390). Report is 222 commits behind head on main.

:exclamation: Current head 665e390 differs from pull request most recent head aed502c. Consider uploading reports for the commit aed502c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5517      +/-   ##
==========================================
+ Coverage   83.38%   85.49%   +2.11%     
==========================================
  Files         297      289       -8     
  Lines       12531    12515      -16     
==========================================
+ Hits        10449    10700     +251     
+ Misses       2082     1815     -267

Flag	Coverage Δ
unittests	`?`
unittests-Solution-Experimental	`85.25% <71.64%> (?)`
unittests-Solution-Stable	`85.45% <71.64%> (?)`
unittests-Unstable-Core	`20.01% <71.64%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...metheus.AspNetCore/PrometheusExporterMiddleware.cs	`70.00% <100.00%> (ø)`
....Prometheus.HttpListener/PrometheusHttpListener.cs	`81.92% <100.00%> (+0.22%)`	:arrow_up:
...tpListener/Internal/PrometheusCollectionManager.cs	`75.67% <67.79%> (-2.11%)`	:arrow_down:

... and 78 files with indirect coverage changes

Apr 08 '24 16:04 codecov[bot]

The race condition is still present and need more work. I'm converting it to draft for now.

Apr 09 '24 17:04 hez2010

Thank you for looking into this @hez2010! Adding @robertcoltheart as FYI.

I think we need some architecture change here. The original design caches the response with the assumption that the same response payload can be used for all scrapers. With the ability to support both Prometheus (text/plain; charset=utf-8; version=0.0.4) and OpenMetrics (application/openmetrics-text; version=1.0.0; charset=utf-8) format, I guess we need to have separate cache for each potential response body, or we don't cache at all.

Apr 09 '24 18:04 reyang

Now it's ready for review. /cc: @reyang @robertcoltheart @vishweshbankwar @CodeBlanch

Apr 11 '24 08:04 hez2010

Do we know when this will be released?

Apr 17 '24 15:04 dutts

@dutts We discussed your PR on our SIG meeting yesterday. IIRC @vishweshbankwar had some feedback he was going to provide. I'll ping him right now as a reminder.

edit: Sorry, @hez2010's PR 😄

Apr 17 '24 17:04 CodeBlanch

@dutts We discussed your PR

I'm confused - what is "your PR"? 👀

Apr 17 '24 17:04 reyang

@dutts We discussed your PR

I'm confused - what is "your PR"? 👀

Hah yes sorry not my PR, I just commented as an interested party. Stumbled across this issue in my deployment, lost half a day to tracking it down to the intermittent corrupt output having Prometheus over and then eventually found this issue.

Apr 17 '24 18:04 dutts

@hez2010 Sorry it looks like your TODO introduced another lint warning. Would you fix that please and then I'll merge?

Apr 18 '24 18:04 CodeBlanch

Done.

Apr 19 '24 00:04 hez2010

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs. The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

Apr 19 '24 13:04 hez2010

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs. The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

@hez2010 - Could you elaborate what was the race condition?

Apr 19 '24 16:04 vishweshbankwar

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

Apr 19 '24 17:04 hez2010

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Apr 19 '24 18:04 vishweshbankwar

Seems that I accidentally broke the cache. Working on a fix.

Apr 20 '24 03:04 hez2010

It turns out to be an invalid assert in the test.

Apr 20 '24 04:04 hez2010

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

Apr 20 '24 04:04 hez2010

All things should work expected now. PTAL.

Apr 20 '24 04:04 hez2010

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

Thanks for confirming @hez2010. Looks like this PR is now trying to solve two issues.

Issue of supporting two formats simultaneously.
Concurrency issue.

Here is what I think we should do.

Keep this PR simple and solve the issue of not supporting two formats simultaneously. The original change you had by simply adding two buffers.
Propose a separate change for fixing concurrency issue. A detailed issue with a simple test first would be helpful.

Apr 22 '24 16:04 vishweshbankwar

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

Apr 23 '24 16:04 hez2010

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

For the test, you can change it to sequential execution for now. The test should validate that the exporter can serve both the formats. When working on concurrency issue, can add the parallel test back.

Apr 23 '24 17:04 vishweshbankwar

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

May 01 '24 03:05 github-actions[bot]

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

May 02 '24 18:05 vishweshbankwar

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

Thanks for pinging!

Sorry for the delay. I have been busy recently so may not be able to apply the changes right now. While feel free to pick up the PR if it's urgent.

May 06 '24 07:05 hez2010

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

May 14 '24 03:05 github-actions[bot]

I can probably pick this up tonight if others are busy.

May 15 '24 02:05 robertcoltheart

I've raised a PR for the first issue, which is splitting the buffers here: https://github.com/open-telemetry/opentelemetry-dotnet/pull/5623 It also optimizes the generation to only the type requested, instead of generating both open metrics and plain.

May 16 '24 11:05 robertcoltheart

closing in favor of #5623

May 17 '24 00:05 vishweshbankwar

opentelemetry-dotnet opentelemetry-dotnet copied to clipboard

[prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests

Changes

Merge requirement checklist

Codecov Report

opentelemetry-dotnet
opentelemetry-dotnet copied to clipboard