eckit icon indicating copy to clipboard operation
eckit copied to clipboard

Make construction and access to of Log::debug<Lib> safe in multiple t…

Open simondsmart opened this issue 1 year ago • 2 comments

…hreads

simondsmart avatar Jan 24 '24 10:01 simondsmart

CLA assistant check
All committers have signed the CLA.

FussyDuck avatar Jan 24 '24 10:01 FussyDuck

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.11%. Comparing base (149b9f7) to head (1f072b3).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #104      +/-   ##
===========================================
- Coverage    62.11%   62.11%   -0.01%     
===========================================
  Files          901      901              
  Lines        49853    49852       -1     
  Branches      3749     3748       -1     
===========================================
- Hits         30968    30966       -2     
- Misses       18885    18886       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jan 24 '24 10:01 codecov-commenter

Multio has failing CI runs due SEGFAULTS we could trace back to this race condition (although it's not happening very often). However, as multio is downstream for a lot of packages it would be preferable to fix the root throug this PR instead of removing the test.

I will address the changes in my comment and then hope to be able to merge this branch

pgeier avatar Jun 10 '24 12:06 pgeier

Okay, we can some new SEGFAULTS in multio now. That's bad ... need to investigate. Downstream ci helped in this case

pgeier avatar Jun 10 '24 13:06 pgeier

Private downstream CI failed. Workflow name: private-downstream-ci-hpc View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9478640308.

github-actions[bot] avatar Jun 12 '24 07:06 github-actions[bot]

This feels like it is ... overly complicated. And not totally clear that it will solve the issue (as opposed to likely solves the issue).

If the segfaults are constrained to the destructors of static objects can we not:

i) Enforce that any debug statements in those destructors use Log::debug() rather than Log::debug<LIB>() -- i.e. use the global object. This should make the construction/destruction order more straightforward to reason

ii) alternatively, in the same way that we have "if (!Main::ready())" in the debug() function, we could add a special post-main flag that gets called (presumably by the destructor of one one of these ThreadSingleton or Channel objects) which we use to activate a different output pathway.

simondsmart avatar Jun 12 '24 19:06 simondsmart

This feels like it is ... overly complicated. And not totally clear that it will solve the issue (as opposed to likely solves the issue).

Compared with the initial solution with thread local map with pointers to static library object, this approach is reversed and stores channels directly in the static library object and maps thread ids - up to this point I would not consider it as more complex. However making sure that exiting threads remove their entry in the maps is where dependency & life time problem of static / thread_local objects hook in... but to resolve these kind of problems, the safest option will always be making use of shared obects both statics can access safely.

If the segfaults are constrained to the destructors of static objects can we not:

i) Enforce that any debug statements in those destructors use Log::debug() rather than Log::debug() -- i.e. use the global object. This should make the construction/destruction order more straightforward to reason

This is a lot of effort and very error prone. It involves a lot of changes downstream, it's likely to forget something and we have no usage to check/enforce this "policy". IMO this will result in recurring occurrences of this phenomenon.

ii) alternatively, in the same way that we have "if (!Main::ready())" in the debug() function, we could add a special post-main flag that gets called (presumably by the destructor of one one of these ThreadSingleton or Channel objects) which we use to activate a different output pathway.

Also in this case, we would need to to "sync" between two static/thread_local obects of different lifetime. To do so we eventually need to make use of a shared_ptr again....

However, a valid option to bring in a flag may be to use a share_ptr and a weak_ptr on the other side ... locking the weak_ptr would be similar to a check ?

pgeier avatar Jun 13 '24 07:06 pgeier

Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9499270776.

github-actions[bot] avatar Jun 13 '24 12:06 github-actions[bot]

Private downstream CI failed. Workflow name: private-downstream-ci-hpc View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9499868549.

github-actions[bot] avatar Jun 13 '24 12:06 github-actions[bot]

This causes crash when Log::debug() called in a class dtor PANIC: ::pthread_setspecific(key_, value) in (src/eckit/thread/ThreadSingleton.h +97 instance)

mcakircali avatar Jun 14 '24 08:06 mcakircali

Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9517498284.

github-actions[bot] avatar Jun 14 '24 14:06 github-actions[bot]

Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9517498284.

github-actions[bot] avatar Jun 14 '24 14:06 github-actions[bot]