root icon indicating copy to clipboard operation
root copied to clipboard

Compatiblity issue: File writting with root 6.32/02 cannot be read back with root 6.10/06

Open wlampl opened this issue 1 year ago • 14 comments

Check duplicate issues.

  • [ ] Checked for duplicates

Description

While trying to update to LCG_106_ATLAS_3 (root 6.32/02) we encountered a test failure. An intermediate file produce with this release could not be read back with an older release (6.10/06, 6.08.06), we encounter a segfault when the file is closed.

Background: ATLAS Trigger simulation of run 2 uses the release that was used for data-taking during run 2.

Reproducer

I copied the intermediate file + reproducer script to /afs/cern.ch/work/w/wlampl/public/ATEAM-1001 The script is quite simple:

from ROOT import TFile
f=TFile.Open("tmp.RDO")
f.ls()
t=f.Get("CollectionTree")
n=t.GetEntries()
for i in range(n):
    s=t.GetEntry(i)
    print(s)
f.Close()

For root versions back to about 6.16.00 it works as expected. Running with 6.08.06 and 6.10.06 (in a centos7 container), I encounter a segfault as the end. A log can be found in /afs/cern.ch/work/w/wlampl/public/ATEAM-1001/log.22.0.0

ROOT version

Writing: 6.32/02 Reading: 6.10/06 or 6.08.06

Installation method

SFT/LCG

Operating system

CentOS7

Additional context

No response

wlampl avatar Jul 02 '24 08:07 wlampl

Let me add a reproducer where you only need to open the file and try to exit:

% setupATLAS -c centos7 --pwd /afs/cern.ch/work/w/wlampl/public/ATEAM-1001 % asetup Athena,21.0,latest % root -b tmp.RDO

| Welcome to ROOT 6.08/06 http://root.cern.ch | Attaching file tmp.RDO as _file0... Warning in TClass::Init: no dictionary for class ROOT::TIOFeatures is available (TFile *) 0x29cf190 root [1] .q

*** Break *** segmentation violation This is the entire stack trace of all threads:

#0 0x00007f6cdd6c560c in waitpid () from /lib64/libc.so.6 #1 0x00007f6cdd642f62 in do_system () from /lib64/libc.so.6 #2 0x00007f6cdecce102 in TUnixSystem::StackTrace() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #3 0x00007f6cdecd061c in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #4 #5 0x0000000001209080 in ?? () #6 0x00007f6cdec52005 in TList::FindObject(TObject const*) const () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #7 0x00007f6cdec5237c in TList::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #8 0x00007f6cdec50a01 in THashTable::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #9 0x00007f6cdec504dd in THashList::Clear(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #10 0x00007f6cdec9d1a7 in TListOfDataMembers::Unload() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #11 0x00007f6cdec7f2d0 in TClass::SetUnloaded() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #12 0x00007f6cdec4a574 in ROOT::RemoveClass(char const*) () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #13 0x00007f6cdec9926e in ROOT::TGenericClassInfo::~TGenericClassInfo() () from /cvmfs/atlas-nightlies.cern.ch/repo/sw/21.0_Athena_x86_64-centos7-gcc62-opt/sw/lcg/releases/ROOT/6.08.06-d7e12/x86_64-centos7-gcc62-opt/lib/libCore.so #14 0x00007f6cdd639ce9 in __run_exit_handlers () from /lib64/libc.so.6

Nowakus avatar Jul 02 '24 09:07 Nowakus

Hi @martamaja10 ,

thanks for looking at this. We see you've assigned @dpiparo but we understand that he's away for a couple of weeks, and ideally we'd like this to be addressed sooner if possible. Is there someone else in the team who could look at this before?

The problem is, this issue prevents us from using LCG106 and so it holds up several developments.

Thanks!

James

jcatmore avatar Jul 02 '24 11:07 jcatmore

Hi @jcatmore,

sure, I'll find another person in the team to take a look at this ASAP.

Cheers, Marta

martamaja10 avatar Jul 02 '24 11:07 martamaja10

Most likely backporting this commit: https://github.com/root-project/root/commit/08b34d72a800bd48ea4655f17075de0ef3ca72cb will fix the problem.

pcanal avatar Jul 02 '24 11:07 pcanal

See https://github.com/root-project/root/pull/15968 and https://github.com/root-project/root/pull/15969

pcanal avatar Jul 02 '24 11:07 pcanal

This issue is most likely due to a change that inadvertently broke forward compatibility: https://github.com/root-project/root/issues/14793

You should have seen this already with 6.30 though. Is there an explanation why 6.30 did not trigger the error?

There are two ways to proceed (if the issue is what we think it is):

  • Backport the fix to 6.10 and 6.08 (as Philippe suggested/submitted)
  • Set the compatibility flag file->SetBit(TFile::k630forwardCompatibility) (see #15006) when you produce the file with 6.32.

The second option would be useful to run at least once to confirm that we identified the right cause.

jblomer avatar Jul 02 '24 12:07 jblomer

Is there any drawback in doing SetBit(TFile::k630forwardCompatibility) for every file we produce now?

Nowakus avatar Jul 02 '24 12:07 Nowakus

Is there any drawback in doing SetBit(TFile::k630forwardCompatibility) for every file we produce now?

The main drawbacks is forgetting to eventually remove it :). The technical drawback is slightly worse and unstable (see for example; https://github.com/root-project/root/issues/12438) compression.

pcanal avatar Jul 02 '24 12:07 pcanal

You should have seen this already with 6.30 though. Is there an explanation why 6.30 did not trigger the error?

Just to comment about 6.30: we didn't look at this release apart from to do a compilation test, so indeed, most likely the issue is there as well as per your expectation.

jcatmore avatar Jul 02 '24 13:07 jcatmore

Hi. I just wanted to understand whether on the ATLAS side the issue was further investigated

dpiparo avatar Jul 13 '24 04:07 dpiparo

We have added a call to SetBit(TFile::k630forwardCompatibility) when writing files that will need to be read by old release branches as part of our standard workflows for earlier LHC runs. This allowed the jobs using older releases to run successfully. This is necessary as the ability to simulate our Trigger is tied to the releases that were being used for data-taking at that time. We would rather that we didn't have to do this though of course.

jchapman-hep avatar Jul 15 '24 14:07 jchapman-hep

I am sorry ROOT did not work out of the box in this case. We are really working hard to provide not only backward but also forward compatibility. In this particular situation, it was not possible.

dpiparo avatar Jul 26 '24 09:07 dpiparo

Hi @dpiparo,

We understand why a fix on your side was not possible in this case, but can you confirm that the workaround of reading files in older releases (6.10/06, 6.08.06) will be part of your tests going forward please? ATLAS will need this feature to be supported for new ROOT versions until such time as we decide to change our support policy for legacy data. (This currently requires Trigger Simulation to be run in the data-taking release from the year in question.)

jchapman-hep avatar Aug 13 '24 14:08 jchapman-hep

On a side note, we back-ported the ability to read the files without the forward compatibility bit to the patch branch for v6.10 and v6.08.

pcanal avatar Aug 13 '24 19:08 pcanal

@dpiparo @pcanal This issue is Open but marked as "Fixed in not applicable". Should it be closed given that there is a workaround?

ferdymercury avatar Aug 12 '25 09:08 ferdymercury

@ferdymercury We still need to put in place a way to test forward compatibility more systematically.

pcanal avatar Sep 10 '25 17:09 pcanal