FairRoot icon indicating copy to clipboard operation
FairRoot copied to clipboard

Random segmentation violation at tear down of FairRoot

Open BSali opened this issue 1 year ago • 5 comments

Describe the bug Sometimes FairRoot ends with a segmentation violation during the "tear down" phase (aka after the FairRunAna instance is being deconstructed). However, this segmentation violation only occurs randomly (maybe 1 out of 6-8 times). I have only seen this happen with fairsoft apr22_patches and FairRoot v18.6_patches using T.Stockmanns Docker-Image. Really scary, if "suddenly" a significant number of the PandaROOT Test fail (always different Tests) and sometimes none ^^

To Reproduce Steps to reproduce the behavior:

  1. Use the Docker-Image using fairsoft apr_22 and FairRoot v18.6.patches available under: https://hub.docker.com/repository/docker/tstockmanns/fairroot_v18_6_fairsoft_apr22_ubuntu_22
  2. Start a Container: docker run -it tstockmanns/fairroot_v18_6_fairsoft_apr22_ubuntu_22
  3. Load the FairRoot Env: source /mnt/work/FairRoot-Install/bin/FairRootConfig.sh -a
  4. Create a minimum example script to reproduce the issue:
echo 'void TestBugScript()
{
    FairLogger::GetLogger()->SetLogScreenLevel("debug");
    FairLogger::GetLogger()->SetColoredLog(true);
    std::unique_ptr<FairRunAna> fRun{new FairRunAna()};
    TString testname = "TestBugScript";
    FairRootFileSink *dummySink = new FairRootFileSink(testname + ".root");
    fRun->SetSink(dummySink);
    fRun->Init(); 
    fRun->TerminateRun();
}'>TestBugScript.C

Important: The problem only occurs, if FairRunAna::Init() was called. 5. Run the script several times (as the issue does only occur randomly: root -l -q TestBugScript.C

Expected behavior I would expect the script to either never crash (preferred scenario) or to always crash (I would at least understand that :)

Logs / Screenshots The segmentation violation is

[DEBUG] Enter Destructor of FairRun
[DEBUG] Leave Destructor of FairRun
[DEBUG] Enter Destructor of FairRootManager
[DEBUG] Leave Destructor of FairRootManager
[DEBUG] FairRootManager::~FairRootManager: going to lock 0 0x55abd0d802b0
[DEBUG] Released lock and done FairRootManager::~FairRootManager in 0 0x55abd0d802b0
[DEBUG] removed RTDB container factory FairBaseContFact

 *** Break *** segmentation violation
 Generating stack trace...
 0x00007fbf3e1d8f55 in TList::~TList() at /usr/include/c++/11/bits/shared_ptr_base.h:797 from /mnt/work/FairSoft-Install//lib/libCore.so.6.26
 0x00007fbf3e1d90fd in TList::~TList() at /mnt/work/FairSoft-Build/Source/root/core/cont/src/TList.cxx:95 from /mnt/work/FairSoft-Install//lib/libCore.so.6.26
 0x00007fbf3e11da09 in TROOT::~TROOT() at /mnt/work/FairSoft-Build/Source/root/core/base/src/TROOT.cxx:883 (discriminator 3) from /mnt/work/FairSoft-Install//lib/libCore.so.6.26
 0x00007fbf3db3f495 in <unknown> from /lib/x86_64-linux-gnu/libc.so.6
 0x00007fbf3db3f610 in on_exit + 0x0 from /lib/x86_64-linux-gnu/libc.so.6
 0x00007fbf3e27155e in TUnixSystem::Exit(int, bool) at /mnt/work/FairSoft-Build/Source/root/core/unix/src/TUnixSystem.cxx:2147 from /mnt/work/FairSoft-Install//lib/libCore.so.6.26
 0x00007fbf3e12c49f in TApplication::Terminate(int) at /mnt/work/FairSoft-Build/Source/root/core/base/src/TApplication.cxx:1679 from /mnt/work/FairSoft-Install//lib/libCore.so.6.26
 0x00007fbf3e50bbf6 in TRint::Run(bool) at /mnt/work/FairSoft-Build/Source/root/core/rint/src/TRint.cxx:488 from /mnt/work/FairSoft-Install//lib/libRint.so.6.26
 0x000055abcdeee2f3 in main + 0x53 from /mnt/work/FairSoft-Install/bin/root.exe
 0x00007fbf3db23d90 in <unknown> from /lib/x86_64-linux-gnu/libc.so.6
 0x00007fbf3db23e40 in __libc_start_main + 0x80 from /lib/x86_64-linux-gnu/libc.so.6
 0x000055abcdeee345 in _start + 0x25 from /mnt/work/FairSoft-Install/bin/root.exe

You will also encounter randomly ROOT logs as describes in Issue #1108. I believe that the error is again related to the deletion of the fBrowsable TList in the globale TROOT instance gROOT. But I have not found the time to play around and tried if something changes if I for example remove the FairRunAna::Init() line gROOT->GetListOfBrowsables()->Add(fTask);.

I also could not check, if this is specific to the environment/version-setup. I can only say, that I have not encountered this issue on my Debian 10, fairsoft apr21_patches, FairRoot v18.6.6 with g++ (Debian 8.3.0-6) 8.3.0 and cmake version 3.18.4.

System information (please complete the following information): The Docker-Image uses:

  • OS: Ubuntu22.04
  • Compiler: g++ (Ubuntu 11.2.0-19ubuntu1) 11.2.0, cmake version 3.22.1
  • Environment: FairSoft apr22_patches, FairRoot v18.6_patches

BSali avatar Sep 08 '22 10:09 BSali

Probably related: I found and temp-fixed something similar in Cbmroot recently, after one of our post-docs tried to improve our CI by introducing a segmentation fault detector and got confused by the CDASH logs.

It was linked to a double release of the FairTaskList entry in one of the ROOT global pointer for the List of Browsables (as remarked above by @BSali ) by first Fairroot and then Root. I think it is because it is being registered by FairRunAna but not removed after local cleanup, while recent versions of ROOT are clearing at session exit all pointers left in their global Lists. Unfortunately, there the recent efforts of ROOT to do better cleanup at session exit hurt us.


Luckily for our side the crash happened in a class where we encapsulate the FairRun pointer, so our temp-patch was only to introduce the following 2 lines before deleting the FairRunAna pointer explicitly:

  TList* badlist = gROOT->GetListOfBrowsables();
  badlist->Remove(badlist->FindObject("FairTaskList"));

=> Adding these two lines at the end of the test script in the initial post of the issue removes the crash on my computer (OS Ubuntu 20.04.5, Fairsoft apr21p2, Fairroot v18.6.7)


I forgot a bit how I reached this conclusion, but at the time (6 weeks ago) I wrote

I suspect this is due to the explicit addition at line 195 of base/steer/FairRunAna.cxx in the FairRoot sources, which seems to be needed only for fairtools/FairMonitor.cxx


For more reference on what we did on our side:

  • Problem description + some thoughts: https://redmine.cbm.gsi.de/issues/2573
  • Cbmroot temp patch: https://git.cbm.gsi.de/computing/cbmroot/-/merge_requests/911

PALoizeau avatar Sep 13 '22 10:09 PALoizeau

Brilliant! Thank you very much for that temp-fix! This does remove the observed crashes!

BSali avatar Sep 13 '22 11:09 BSali

I think, this might be duplicate of #1108?

I added a note there.

ChristianTackeGSI avatar Oct 07 '22 12:10 ChristianTackeGSI

Looks indeed like two consequences of the same problem.

I will try in the coming days to cherry-pick the commit from the other issue into a local version of our default Fairroot v18.6.7 + Cbmroot master and see if with it I can get our CI through without the Cbmroot temp-fix. Probably will report the result in the other issue to keep things in a single place.

PALoizeau avatar Oct 17 '22 09:10 PALoizeau

We fixed this in dev: ac9dba598eb911b7bc344b16a0231dd7266990bd It was backported to 18.8: f9f1648108e58ca998cd0b8a0d67601440934ad0 And to 18.6_patches: f6928ee732e1a52527ca30ea4f442c44699c5523

Does this all fix this issue?

ChristianTackeGSI avatar Jan 23 '23 12:01 ChristianTackeGSI