opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

[prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests

Open hez2010 opened this issue 1 year ago • 25 comments

Fixes #5515

Changes

Fixed a race condition which can cause returning an incorrect response while OpenMetrics requests and PlainText requests are being processed simultaneously.

Merge requirement checklist

  • [x] CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)

hez2010 avatar Apr 08 '24 05:04 hez2010

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: hez2010 / name: Steve (7df9d9c742bf4fd0f3620692375ba31f8ecc8431, 224d1bdeace627d7984cb6c4feb70335d4b389d4, 1c843d0d8e1938f8c17136d48fcc2b3d877a7ec6, 2804a7dcc765dc849c0a7a5f447fc026a2f79878, 00149ef38cb0cb2a561cf7eef01742fcaab4ea9e, 0e2551a604a97cdf429a8d7257902d7bafb338e5, c1af8238cd65c4ec7e0eb13c13f1510a237c3fc8, fd01eb269a85d0494d440f876863bbfca1776a6b, 19ae57095501b0b8b4675743dd889acdd24ef8e6, c2e10df9f61d117a09f4732facd05992425a698f, 286d23d74a7efec90f6ed9d046d30a14bb6cfcf2, 3ee60675d2e01d94c056514bd728877aacaaf564, cb957662cb484beb4a4a598bce6f462196488803, 0269b936971cd724abe406dc035a91c352f1ad14, 45dadb239baa60a01896ed024c3706b5fc0c94ea, 580257e696bd30a9f1ba5cc1d878fffb278a6c33, 677b121316690a76caf396fe40bf23a0efa61196, 7ee37127193eb42f30a804ead21a2e6d66ef3a8b, 63f32d66e5db16063589d35d2d687255baaf13bf, 3e003f8144d556474d05d52fe7fe2997baa8e88f, fb54fc8f0807a1daf467b8d01791ff5500bce074, d930f9773d0c68ca805acfb3cd93f15285c93ef7, 3558e9f8b1dab0e09477fb999bae4db8d3fa542a, aaf7e4396107ffa9be585ac7f91aae3f2e0cf80e, 6e5549fcb49ad8d512e7c53632811546f047567e, aed502c389e712f21cf8a7577e9b68618c80137c)
  • :white_check_mark: login: CodeBlanch / name: Mikel Blanchard (017dd39e31a3a4176950f0ba668690ea72ba5d6f, 7b3b31e46d46ad6b5237a249dff1e763565fb50c)
  • :white_check_mark: login: vishweshbankwar / name: Vishwesh Bankwar (033148cbbdc5c021fe8c98deddeb19c2b532a62d, 90c348c8bf0944228eee685b5aa304b79fe45048)
  • :white_check_mark: login: reyang / name: Reiley Yang (665e39076a3abbd497f8deb26f3bdf9845a965da)

Codecov Report

Attention: Patch coverage is 71.64179% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 85.49%. Comparing base (6250307) to head (665e390). Report is 222 commits behind head on main.

:exclamation: Current head 665e390 differs from pull request most recent head aed502c. Consider uploading reports for the commit aed502c to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5517      +/-   ##
==========================================
+ Coverage   83.38%   85.49%   +2.11%     
==========================================
  Files         297      289       -8     
  Lines       12531    12515      -16     
==========================================
+ Hits        10449    10700     +251     
+ Misses       2082     1815     -267     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 85.25% <71.64%> (?)
unittests-Solution-Stable 85.45% <71.64%> (?)
unittests-Unstable-Core 20.01% <71.64%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...metheus.AspNetCore/PrometheusExporterMiddleware.cs 70.00% <100.00%> (ø)
....Prometheus.HttpListener/PrometheusHttpListener.cs 81.92% <100.00%> (+0.22%) :arrow_up:
...tpListener/Internal/PrometheusCollectionManager.cs 75.67% <67.79%> (-2.11%) :arrow_down:

... and 78 files with indirect coverage changes

codecov[bot] avatar Apr 08 '24 16:04 codecov[bot]

The race condition is still present and need more work. I'm converting it to draft for now.

hez2010 avatar Apr 09 '24 17:04 hez2010

Thank you for looking into this @hez2010! Adding @robertcoltheart as FYI.

I think we need some architecture change here. The original design caches the response with the assumption that the same response payload can be used for all scrapers. With the ability to support both Prometheus (text/plain; charset=utf-8; version=0.0.4) and OpenMetrics (application/openmetrics-text; version=1.0.0; charset=utf-8) format, I guess we need to have separate cache for each potential response body, or we don't cache at all.

reyang avatar Apr 09 '24 18:04 reyang

Now it's ready for review. /cc: @reyang @robertcoltheart @vishweshbankwar @CodeBlanch

hez2010 avatar Apr 11 '24 08:04 hez2010

Do we know when this will be released?

dutts avatar Apr 17 '24 15:04 dutts

@dutts We discussed your PR on our SIG meeting yesterday. IIRC @vishweshbankwar had some feedback he was going to provide. I'll ping him right now as a reminder.

edit: Sorry, @hez2010's PR 😄

CodeBlanch avatar Apr 17 '24 17:04 CodeBlanch

@dutts We discussed your PR

I'm confused - what is "your PR"? 👀

reyang avatar Apr 17 '24 17:04 reyang

@dutts We discussed your PR

I'm confused - what is "your PR"? 👀

Hah yes sorry not my PR, I just commented as an interested party. Stumbled across this issue in my deployment, lost half a day to tracking it down to the intermittent corrupt output having Prometheus over and then eventually found this issue.

dutts avatar Apr 17 '24 18:04 dutts

@hez2010 Sorry it looks like your TODO introduced another lint warning. Would you fix that please and then I'll merge?

CodeBlanch avatar Apr 18 '24 18:04 CodeBlanch

Done.

hez2010 avatar Apr 19 '24 00:04 hez2010

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs. The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

hez2010 avatar Apr 19 '24 13:04 hez2010

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs. The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

@hez2010 - Could you elaborate what was the race condition?

vishweshbankwar avatar Apr 19 '24 16:04 vishweshbankwar

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

hez2010 avatar Apr 19 '24 17:04 hez2010

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

vishweshbankwar avatar Apr 19 '24 18:04 vishweshbankwar

Seems that I accidentally broke the cache. Working on a fix.

hez2010 avatar Apr 20 '24 03:04 hez2010

It turns out to be an invalid assert in the test.

hez2010 avatar Apr 20 '24 04:04 hez2010

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

hez2010 avatar Apr 20 '24 04:04 hez2010

All things should work expected now. PTAL.

hez2010 avatar Apr 20 '24 04:04 hez2010

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

Thanks for confirming @hez2010. Looks like this PR is now trying to solve two issues.

  1. Issue of supporting two formats simultaneously.
  2. Concurrency issue.

Here is what I think we should do.

  1. Keep this PR simple and solve the issue of not supporting two formats simultaneously. The original change you had by simply adding two buffers.

  2. Propose a separate change for fixing concurrency issue. A detailed issue with a simple test first would be helpful.

vishweshbankwar avatar Apr 22 '24 16:04 vishweshbankwar

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

hez2010 avatar Apr 23 '24 16:04 hez2010

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

For the test, you can change it to sequential execution for now. The test should validate that the exporter can serve both the formats. When working on concurrency issue, can add the parallel test back.

vishweshbankwar avatar Apr 23 '24 17:04 vishweshbankwar

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar May 01 '24 03:05 github-actions[bot]

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

vishweshbankwar avatar May 02 '24 18:05 vishweshbankwar

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

Thanks for pinging!

Sorry for the delay. I have been busy recently so may not be able to apply the changes right now. While feel free to pick up the PR if it's urgent.

hez2010 avatar May 06 '24 07:05 hez2010

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar May 14 '24 03:05 github-actions[bot]

I can probably pick this up tonight if others are busy.

robertcoltheart avatar May 15 '24 02:05 robertcoltheart

I've raised a PR for the first issue, which is splitting the buffers here: https://github.com/open-telemetry/opentelemetry-dotnet/pull/5623 It also optimizes the generation to only the type requested, instead of generating both open metrics and plain.

robertcoltheart avatar May 16 '24 11:05 robertcoltheart

closing in favor of #5623

vishweshbankwar avatar May 17 '24 00:05 vishweshbankwar