opentelemetry-dotnet
opentelemetry-dotnet copied to clipboard
[prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests
Fixes #5515
Changes
Fixed a race condition which can cause returning an incorrect response while OpenMetrics requests and PlainText requests are being processed simultaneously.
Merge requirement checklist
- [x] CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: hez2010 / name: Steve (7df9d9c742bf4fd0f3620692375ba31f8ecc8431, 224d1bdeace627d7984cb6c4feb70335d4b389d4, 1c843d0d8e1938f8c17136d48fcc2b3d877a7ec6, 2804a7dcc765dc849c0a7a5f447fc026a2f79878, 00149ef38cb0cb2a561cf7eef01742fcaab4ea9e, 0e2551a604a97cdf429a8d7257902d7bafb338e5, c1af8238cd65c4ec7e0eb13c13f1510a237c3fc8, fd01eb269a85d0494d440f876863bbfca1776a6b, 19ae57095501b0b8b4675743dd889acdd24ef8e6, c2e10df9f61d117a09f4732facd05992425a698f, 286d23d74a7efec90f6ed9d046d30a14bb6cfcf2, 3ee60675d2e01d94c056514bd728877aacaaf564, cb957662cb484beb4a4a598bce6f462196488803, 0269b936971cd724abe406dc035a91c352f1ad14, 45dadb239baa60a01896ed024c3706b5fc0c94ea, 580257e696bd30a9f1ba5cc1d878fffb278a6c33, 677b121316690a76caf396fe40bf23a0efa61196, 7ee37127193eb42f30a804ead21a2e6d66ef3a8b, 63f32d66e5db16063589d35d2d687255baaf13bf, 3e003f8144d556474d05d52fe7fe2997baa8e88f, fb54fc8f0807a1daf467b8d01791ff5500bce074, d930f9773d0c68ca805acfb3cd93f15285c93ef7, 3558e9f8b1dab0e09477fb999bae4db8d3fa542a, aaf7e4396107ffa9be585ac7f91aae3f2e0cf80e, 6e5549fcb49ad8d512e7c53632811546f047567e, aed502c389e712f21cf8a7577e9b68618c80137c)
- :white_check_mark: login: CodeBlanch / name: Mikel Blanchard (017dd39e31a3a4176950f0ba668690ea72ba5d6f, 7b3b31e46d46ad6b5237a249dff1e763565fb50c)
- :white_check_mark: login: vishweshbankwar / name: Vishwesh Bankwar (033148cbbdc5c021fe8c98deddeb19c2b532a62d, 90c348c8bf0944228eee685b5aa304b79fe45048)
- :white_check_mark: login: reyang / name: Reiley Yang (665e39076a3abbd497f8deb26f3bdf9845a965da)
Codecov Report
Attention: Patch coverage is 71.64179% with 19 lines in your changes are missing coverage. Please review.
Project coverage is 85.49%. Comparing base (
6250307) to head (665e390). Report is 222 commits behind head on main.
:exclamation: Current head 665e390 differs from pull request most recent head aed502c. Consider uploading reports for the commit aed502c to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #5517 +/- ##
==========================================
+ Coverage 83.38% 85.49% +2.11%
==========================================
Files 297 289 -8
Lines 12531 12515 -16
==========================================
+ Hits 10449 10700 +251
+ Misses 2082 1815 -267
| Flag | Coverage Δ | |
|---|---|---|
| unittests | ? |
|
| unittests-Solution-Experimental | 85.25% <71.64%> (?) |
|
| unittests-Solution-Stable | 85.45% <71.64%> (?) |
|
| unittests-Unstable-Core | 20.01% <71.64%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files | Coverage Δ | |
|---|---|---|
| ...metheus.AspNetCore/PrometheusExporterMiddleware.cs | 70.00% <100.00%> (ø) |
|
| ....Prometheus.HttpListener/PrometheusHttpListener.cs | 81.92% <100.00%> (+0.22%) |
:arrow_up: |
| ...tpListener/Internal/PrometheusCollectionManager.cs | 75.67% <67.79%> (-2.11%) |
:arrow_down: |
The race condition is still present and need more work. I'm converting it to draft for now.
Thank you for looking into this @hez2010! Adding @robertcoltheart as FYI.
I think we need some architecture change here. The original design caches the response with the assumption that the same response payload can be used for all scrapers. With the ability to support both Prometheus (text/plain; charset=utf-8; version=0.0.4) and OpenMetrics (application/openmetrics-text; version=1.0.0; charset=utf-8) format, I guess we need to have separate cache for each potential response body, or we don't cache at all.
Now it's ready for review. /cc: @reyang @robertcoltheart @vishweshbankwar @CodeBlanch
Do we know when this will be released?
@dutts We discussed your PR on our SIG meeting yesterday. IIRC @vishweshbankwar had some feedback he was going to provide. I'll ping him right now as a reminder.
edit: Sorry, @hez2010's PR 😄
@dutts We discussed your PR
I'm confused - what is "your PR"? 👀
@dutts We discussed your PR
I'm confused - what is "your PR"? 👀
Hah yes sorry not my PR, I just commented as an interested party. Stumbled across this issue in my deployment, lost half a day to tracking it down to the intermittent corrupt output having Prometheus over and then eventually found this issue.
@hez2010 Sorry it looks like your TODO introduced another lint warning. Would you fix that please and then I'll merge?
Done.
I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs.
The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.
I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of
this.collectionTcs. The testPrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallelnow passed with even 100,000,000 requests on my machine.
@hez2010 - Could you elaborate what was the race condition?
@hez2010 - Could you elaborate what was the race condition?
We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues.
Here I changed it to use CompareExchange instead.
You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.
@hez2010 - Could you elaborate what was the race condition?
We are neither using
lock (obj)nor inserting a memory barrier aroundthis.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to useCompareExchangeinstead.You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.
This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here
Seems that I accidentally broke the cache. Working on a fix.
It turns out to be an invalid assert in the test.
This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here
Yes. My newly added test sometimes never finishes against the existing code as well.
All things should work expected now. PTAL.
This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here
Yes. My newly added test sometimes never finishes against the existing code as well.
Thanks for confirming @hez2010. Looks like this PR is now trying to solve two issues.
- Issue of supporting two formats simultaneously.
- Concurrency issue.
Here is what I think we should do.
-
Keep this PR simple and solve the issue of not supporting two formats simultaneously. The original change you had by simply adding two buffers.
-
Propose a separate change for fixing concurrency issue. A detailed issue with a simple test first would be helpful.
I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.
I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.
For the test, you can change it to sequential execution for now. The test should validate that the exporter can serve both the formats. When working on concurrency issue, can add the parallel test back.
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.
@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?
@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?
Thanks for pinging!
Sorry for the delay. I have been busy recently so may not be able to apply the changes right now. While feel free to pick up the PR if it's urgent.
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.
I can probably pick this up tonight if others are busy.
I've raised a PR for the first issue, which is splitting the buffers here: https://github.com/open-telemetry/opentelemetry-dotnet/pull/5623 It also optimizes the generation to only the type requested, instead of generating both open metrics and plain.
closing in favor of #5623