root
root copied to clipboard
Fix race condition when loading dictionary shared libraries
Earlier, ATLAS was seeing a nasty race condition involving ROOT dictionary information (see ATR-25049).
The ROOT internal class TClassTable records information about all classes that could potentially be created. This is effectively a singleton, but TClassTable itself does no locking. Rather, it depends on callers already having acquired the root internal mutex. When a shared library is loaded that contains dictionary information, TClassTable gets calls to register information for classes defined in that file. However, the ROOT lock is not acquired in that case. So a shared library load could race with TClass::GetClass and result in corruption of TClassTable.
This change modifies TGenericClassInfo::Init so that we will take the lock when a shared library is loaded.
(ATLAS is currently working around this by hacking shared library loading in order to acquire the lock, but this hack is not possible with newer versions of glibc, so this will eventually become a blocker for moving to centos9.)
Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac1015/cxx17, mac11/cxx14, windows10/cxx14
How to customize builds
One concern is a possible dead-lock between the ROOT lock and the dlopen lock .... We may need a more fine grain solution.
Thanks Philippe.
After further consideration, i think you're right.
I had indeed been concerned about a possible deadlock, but i had thought that it was probably ok because a similar change was working fine in production. But (a), i had forgotten about the dlopen lock and (b), they weren't exactly the same change. As i alluded to above, we didn't want to have a custom patch to ROOT, so for production we address this race by wrapping dlopen. But in that case, the core lock is acquired before the call to dlopen, not after.
So, what is to be done?
There is probably only one code path in ATLAS that was giving us problems. The Gaudi plugin manager was being used to load one of the POOL libraries that happened to include ROOT dictionary information. I could acquire the ROOT core lock around this (and one or two other places where i known that Gaudi components may be loaded after event processing starts). However, there are many other places where shared libraries could potentially be loaded. While those probably don't matter... the original bug here was quite opaque, taking several months to completely track down. So i'm loathe to leave open any possibility that it could bite us again.
We could add locking within TClassTable. This is perhaps an attractive way forward, as it should make it much easier to ensure that usages of TClassTable are correct. A drawback is that is hard to then ensure that we don't have something else that is also supposed to be protected by the core mutex. Nevertheless, i could try to prepare a change to do this if it sounds good.
However, that's not the end of the problems.
If this analysis is correct, then we already have potential deadlock problems, as there is nothing to prevent the initialization code of a shared library (which can be arbitrary used code) from calling into ROOT in a way that would acquire the core mutex. So code which would look innocent to the user would have a chance of deadlocking. I'm not sure that there's any good way of addressing this short of having ROOT release the core mutex before calling dlopen. But this would probably require introducing another mutex for use in at least TClass::GetClass to serialize dictionary loading.
Will change this to a draft for the time being.
If this analysis is correct, then we already have potential deadlock problems, ...
Indeed but the only we can really affect is the code that is intentionally (and necessarily) called during library loading. The rest (for better and worse) has to be left to the user (to avoid).
I'm not sure that there's any good way of addressing this short of having ROOT release the core mutex before calling dlopen.
I don't think so either. The only way where we could have support this would be we could (and of course we can not :) ) enforce that all dlopen have to go through the ROOT wrapper for it.
We could add locking within TClassTable.
So I think this is the only choice ....
I am able to reproduce the failure with a simple standalone reproducer (2 threads one loading and unloading a library, the other interogating TClassTable). I will prepare a PR with the fine grained locking (or whatever is needed to make the crash go away :) )
For the record Helgrind does report the issue we have been discussed when run on my reproducer (i.e. this is a praise for helgrind :) ).
==1413== Possible data race during write of size 8 at 0x56DD6F0 by thread #2
==1413== Locks held: 1, at address 0x403AA48
==1413== at 0x4CEDDBD: TClassTable::Remove(char const*) (TClassTable.cxx:500)
==1413== by 0x4CEEDD3: ROOT::RemoveClass(char const*) (TClassTable.cxx:859)
==1413== by 0x4882C25: ROOT::Internal::TDefaultInitBehavior::Unregister(char const*) const (Rtypes.h:172)
==1413== by 0x4D8180D: ROOT::TGenericClassInfo::~TGenericClassInfo() (TGenericClassInfo.cxx:221)
==1413== by 0x51F0A55: __cxa_finalize (cxa_finalize.c:83)
==1413== by 0x14FA76D6: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/libUser.so)
==1413== by 0x40021A1: call_destructors (dl-close.c:129)
==1413== by 0x531FC84: _dl_catch_exception (dl-error-skeleton.c:182)
==1413== by 0x4002635: _dl_close_worker.part.0.isra.0 (dl-close.c:292)
==1413== by 0x40032A1: _dl_close_worker (dl-close.c:150)
==1413== by 0x40032A1: _dl_close (dl-close.c:818)
==1413== by 0x531FC27: _dl_catch_exception (dl-error-skeleton.c:208)
==1413== by 0x531FCF2: _dl_catch_error (dl-error-skeleton.c:227)
==1413==
==1413== This conflicts with a previous read of size 8 by thread #3
==1413== Locks held: none
==1413== at 0x4CEDF22: TClassTable::FindElementImpl(char const*, bool) (TClassTable.cxx:521)
==1413== by 0x4CEE388: TClassTable::GetDictNorm(char const*) (TClassTable.cxx:618)
==1413== by 0x12FA9DE1: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== by 0x12FAAD97: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== by 0x12FAAD0C: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== by 0x12FAAC6C: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== by 0x12FAAC21: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== by 0x12FAAC01: ??? (in /home/pcanal/root_working/test/2023-thread/tclasstable/dlopenrace_C.so)
==1413== Address 0x56dd6f0 is 7,312 bytes inside a block of size 8,072 alloc'd
==1413== at 0x484C373: operator new[](unsigned long) (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1413== by 0x4CECE13: TClassTable::TClassTable() (TClassTable.cxx:223)
==1413== by 0x4CEF6F2: TClassTable::CheckClassTableInit() (TClassTable.cxx:266)
==1413== by 0x4CEE015: TClassTable::FindElement(char const*, bool) (TClassTable.cxx:541)
==1413== by 0x4CEEBF3: ROOT::ResetClassVersion(TClass*, char const*, short) (TClassTable.cxx:813)
==1413== by 0x4D82480: ROOT::TGenericClassInfo::SetVersion(short) (TGenericClassInfo.cxx:419)
==1413== by 0x4C22B41: __static_initialization_and_destruction_0(int, int) (String.cxx:39)
==1413== by 0x4C22B62: _GLOBAL__sub_I_String.cxx (String.cxx:39)
==1413== by 0x400647D: call_init.part.0 (dl-init.c:70)
==1413== by 0x4006567: call_init (dl-init.c:33)
==1413== by 0x4006567: _dl_init (dl-init.c:117)
==1413== by 0x40202E9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==1413== by 0x4: ???
==1413== Block was alloc'd by thread #1
I am closing this PR. The resolution can be followed through the issue #12552. A new PR implementing the fine grained locking in TClassTable will be uploaded shortly.
Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds
Build failed on mac12/noimt. Running on macphsft17.dyndns.cern.ch:/Users/sftnight/build/jenkins/workspace/root-pullrequests-build See console output.
Warnings:
- [2023-05-04T16:30:20.780Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/clib/src/mmemalign.c:37:13: warning: performing pointer subtraction with a null pointer has undefined behavior [-Wnull-pointer-subtraction]
- [2023-05-04T16:30:20.780Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/clib/src/mfree.c:194:13: warning: performing pointer subtraction with a null pointer has undefined behavior [-Wnull-pointer-subtraction]
- [2023-05-04T16:30:20.780Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/clib/src/mmalloc.c:51:9: warning: performing pointer subtraction with a null pointer has undefined behavior [-Wnull-pointer-subtraction]
- [2023-05-04T16:30:20.780Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/clib/src/mmalloc.c:200:13: warning: performing pointer subtraction with a null pointer has undefined behavior [-Wnull-pointer-subtraction]
- [2023-05-04T16:30:38.652Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TDirectory.cxx:1292:7: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
- [2023-05-04T16:30:43.474Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TTimeStamp.cxx:289:7: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
- [2023-05-04T16:30:43.474Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TTimeStamp.cxx:335:4: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
- [2023-05-04T16:30:43.731Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TUri.cxx:827:10: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
- [2023-05-04T16:30:43.993Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TUrl.cxx:199:10: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
- [2023-05-04T16:30:43.993Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TUrl.cxx:449:10: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
And 71 more
Failing tests:
- projectroot.roottest.cling.const.roottest_cling_const_run2
- projectroot.roottest.cling.dict.roottest_cling_dict_rundefaultargs_compiled
- projectroot.roottest.cling.dict.roottest_cling_dict_exectemplatetemplateTest
- projectroot.roottest.cling.dict.roottest_cling_dict_runalgebra
- projectroot.roottest.cling.functionTemplate.roottest_cling_functionTemplate_testcint
- projectroot.roottest.cling.parsing.roottest_cling_parsing_ptrconst
- projectroot.roottest.cling.template.roottest_cling_template_defaultTemplateParam
- projectroot.roottest.cling.typedef.roottest_cling_typedef_assertFuncArray
- projectroot.roottest.python.JupyROOT.roottest_python_JupyROOT_simpleCppMagic_notebook
- projectroot.roottest.python.JupyROOT.roottest_python_JupyROOT_thread_local_notebook
And 53 more