opentelemetry-cpp icon indicating copy to clipboard operation
opentelemetry-cpp copied to clipboard

Avoid missing conditional variable update and simplify atomic bool

Open arekay opened this issue 1 year ago • 12 comments

Addresses two issues -

  1. Fix the use of a conditional variable where a wait on the variable might not be in flight when a notify is called. This is fixed by ensuring that an associated lock is aquired before calling the notify.
  2. Instead of relying on a lock an a boolean, replace the use wit a single atomic boolean.

Fixes # (issue)

Changes

  • Adds a scoped lock around notify call to conditional var in periodic exporter
  • Replaces a spinlock+bool with an atomic bool in metrics reader

For significant contributions please make sure you have completed the following items:

  • [ ] CHANGELOG.md updated for non-trivial changes
  • [ ] Unit tests have been added
  • [ ] Changes in public API reviewed

arekay avatar Feb 23 '24 23:02 arekay

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: esigo / name: Ehsan Saei (84a146361a0646f62ee7ba25a1963b7ade72d364)
  • :white_check_mark: login: arekay / name: Rashid Kaleem (5475943b8693059d2406f60270d88b6103caccb8, 549bc411fefb80b6552ef867d9fa470fb9f612db, 1c4b7f395f86b8d7448db79d641741710a146cb2, 2b357fbe027afb0c65a6ec8c938841a0f75e97f4, 41443bfb42e2ee9e3562d430caeb37c6053bf113, d80f3e2c88fff7d101430917d7a9a0a05f6e0bb2, b7b3675ead01f5adf66e1e95476bfe3b21b3c456, 32e470d5d2dfa98370f26d284ee89f7210d6e685, 950db861a8ca4eddc41b9efcd8807af5a088b9fa, f0d784fa0719df186ead10e1e81c5d3679f2d1f2, 8111dffa54d289a7b4e498e786ba8c7658c497f0)
  • :white_check_mark: login: marcalff / name: Marc Alff (808c3c4fb007adf6703dd9e5da99e141ad5f27eb, d7ab5af8946c8584319d438ff0701f8fb8a2883d, 21b04f47862a542eba181f0c77c34f921ae19b9a)

Thanks for the PR.

Not a full review, but this looks good based on a first read.

Please sign the EasyCLA to pass the CI checks.

marcalff avatar Feb 23 '24 23:02 marcalff

Is it possible to add a test case to reproduce the issue and verify the fix?

ThomsonTan avatar Mar 28 '24 05:03 ThomsonTan

Is it possible to add a test case to reproduce the issue and verify the fix?

Good point. But this problem is hard to reproduce. I have no idea about how to add test case by now. Maybe someone else can help?

owent avatar Mar 29 '24 15:03 owent

Is it possible to add a test case to reproduce the issue and verify the fix?

Good point. But this problem is hard to reproduce. I have no idea about how to add test case by now. Maybe someone else can help?

It is very unlikely a test case can be written for this.

From other projects using mutex and condition variables, testing requires some serious code injection in debug, to control exactly how threads executes, to block a thread at a given point. opentelemetry-cpp does not have the tooling for this.

marcalff avatar Apr 02 '24 20:04 marcalff

LGTM and approved, waiting for @owent to confirm since he had comments earlier.

marcalff avatar Apr 05 '24 10:04 marcalff

LGTM and approved, waiting for @owent to confirm since he had comments earlier.

Sorry, I'm on holiday these days.

#2584 may also fixes the problems this PR try to slove, and it also solve other deadlock problems in traces and metrics. And it has conflicts with this PR. Do you think we can just use #2584 and drop this one?

owent avatar Apr 06 '24 02:04 owent

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

arekay avatar Apr 09 '24 17:04 arekay

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

owent avatar Apr 10 '24 01:04 owent

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

arekay avatar Apr 18 '24 22:04 arekay

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

@owent Just checking if there is any update on this. Thanks.

arekay avatar Apr 30 '24 02:04 arekay

I have not had time to look into the deadlock problem, but anecdotally, we haven't had any deadlocks and eliminated crashes due to OTEL in our deployments. I believe this would be beneficial to all, even if short-lived and updated by #2584 .

@owent Just checking if there is any update on this. Thanks.

@marcalff could you please check the lock problem again when you have time, or could #2584 be merged now?

owent avatar Apr 30 '24 10:04 owent

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 87.57%. Comparing base (497eaf4) to head (808c3c4). Report is 58 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2553      +/-   ##
==========================================
+ Coverage   87.12%   87.57%   +0.46%     
==========================================
  Files         200      188      -12     
  Lines        6109     5848     -261     
==========================================
- Hits         5322     5121     -201     
+ Misses        787      727      -60     
Files Coverage Δ
.../include/opentelemetry/sdk/metrics/metric_reader.h 50.00% <ø> (ø)
...metrics/export/periodic_exporting_metric_reader.cc 79.79% <100.00%> (+0.44%) :arrow_up:
sdk/src/metrics/metric_reader.cc 77.42% <100.00%> (-1.36%) :arrow_down:

... and 55 files with indirect coverage changes

codecov[bot] avatar May 08 '24 10:05 codecov[bot]

@owent If #2584 is coming in soon, we can wait. Otherwise, we'd appreciate it if you can merge this and then follow up with #2584 since we have been using #2553 as a patch and it has fixed the frequent crashes we were seeing. This will give you more time to test #2584 as well.

I have the last question about the deadlock problem above. Other codes looks good to me.

I looked again in details about the last remaining concern from @owent about a possible deadlock.

After code analysis, I don't think the deadlock is possible, as the code uses a well known pattern of locking a mutex before signalling a condition variable, and the condition wait_for() internally releases the mutex when the wait starts, to re acquire the lock once the wait ends.

Now merging.

@arekay

Thanks for the fix, and also for your patience, sorry this review took such a long time.

marcalff avatar May 08 '24 12:05 marcalff

Thanks for the fix, and also for your patience, sorry this review took such a long time.

@marcalff @owent Thanks for all the feedback and diligence - and of course, thanks for all the work you folks put in to maintaining something all of us benefit tremendously from!

arekay avatar May 08 '24 15:05 arekay