eckit
eckit copied to clipboard
Make construction and access to of Log::debug<Lib> safe in multiple t…
…hreads
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 62.11%. Comparing base (
149b9f7) to head (1f072b3).
Additional details and impacted files
@@ Coverage Diff @@
## develop #104 +/- ##
===========================================
- Coverage 62.11% 62.11% -0.01%
===========================================
Files 901 901
Lines 49853 49852 -1
Branches 3749 3748 -1
===========================================
- Hits 30968 30966 -2
- Misses 18885 18886 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Multio has failing CI runs due SEGFAULTS we could trace back to this race condition (although it's not happening very often). However, as multio is downstream for a lot of packages it would be preferable to fix the root throug this PR instead of removing the test.
I will address the changes in my comment and then hope to be able to merge this branch
Okay, we can some new SEGFAULTS in multio now. That's bad ... need to investigate. Downstream ci helped in this case
Private downstream CI failed. Workflow name: private-downstream-ci-hpc View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9478640308.
This feels like it is ... overly complicated. And not totally clear that it will solve the issue (as opposed to likely solves the issue).
If the segfaults are constrained to the destructors of static objects can we not:
i) Enforce that any debug statements in those destructors use Log::debug() rather than Log::debug<LIB>() -- i.e. use the global object. This should make the construction/destruction order more straightforward to reason
ii) alternatively, in the same way that we have "if (!Main::ready())" in the debug() function, we could add a special post-main flag that gets called (presumably by the destructor of one one of these ThreadSingleton or Channel objects) which we use to activate a different output pathway.
This feels like it is ... overly complicated. And not totally clear that it will solve the issue (as opposed to likely solves the issue).
Compared with the initial solution with thread local map with pointers to static library object, this approach is reversed and stores channels directly in the static library object and maps thread ids - up to this point I would not consider it as more complex. However making sure that exiting threads remove their entry in the maps is where dependency & life time problem of static / thread_local objects hook in... but to resolve these kind of problems, the safest option will always be making use of shared obects both statics can access safely.
If the segfaults are constrained to the destructors of static objects can we not:
i) Enforce that any debug statements in those destructors use Log::debug() rather than Log::debug() -- i.e. use the global object. This should make the construction/destruction order more straightforward to reason
This is a lot of effort and very error prone. It involves a lot of changes downstream, it's likely to forget something and we have no usage to check/enforce this "policy". IMO this will result in recurring occurrences of this phenomenon.
ii) alternatively, in the same way that we have "if (!Main::ready())" in the debug() function, we could add a special post-main flag that gets called (presumably by the destructor of one one of these ThreadSingleton or Channel objects) which we use to activate a different output pathway.
Also in this case, we would need to to "sync" between two static/thread_local obects of different lifetime. To do so we eventually need to make use of a shared_ptr again....
However, a valid option to bring in a flag may be to use a share_ptr and a weak_ptr on the other side ... locking the weak_ptr would be similar to a check ?
Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9499270776.
Private downstream CI failed. Workflow name: private-downstream-ci-hpc View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9499868549.
This causes crash when Log::debug() called in a class dtor PANIC: ::pthread_setspecific(key_, value) in (src/eckit/thread/ThreadSingleton.h +97 instance)
Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9517498284.
Private downstream CI succeeded. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf-actions/private-downstream-ci/actions/runs/9517498284.