root icon indicating copy to clipboard operation
root copied to clipboard

Invalid RDataFrames returned from functions in pyROOT when constructing from a TChain due to garbage collection

Open KeanuGh opened this issue 3 years ago • 5 comments

Describe the bug

RDataFrame constructed within a function with a TChain becomes inaccessible when returned, unless TChain is returned with it (segfault occurs).

Expected behavior

RDataFrame constructed with a TChain should be able to be returned from a function

To Reproduce

import ROOT
import glob

def build_rdataframe(filepath, tree):
    chain = ROOT.TChain(tree)
    for file in glob.glob(filepath):
        chain.Add(file)
    Rdf = ROOT.RDataFrame(chain)

    return chain, Rdf

chain, df = build_rdataframe("path/to/files", "tree")
print(df.Count().GetValue())
# works

def build_rdataframe_2(filepath, tree):                                                                                                                                                     
    chain = ROOT.TChain(tree)
    for file in glob.glob(filepath):
        chain.Add(file)
    Rdf = ROOT.RDataFrame(chain)

    return Rdf

df = build_rdataframe_2("path/to/files", "tree")
print(df.Count().GetValue())
# segfault

Setup

  1. ROOT version 6.24.00 / Python 3.10.4 | ROOT version 6.20.06 / Python 2.7.5
  2. Arch Linux / Centos7
  3. ROOT downloaded from conda-forge / Arch linux repository / lxplus (tested on multiple OSs/environments)

KeanuGh avatar Jul 14 '22 18:07 KeanuGh

Remark: If all trees of the TChain has the same name, and the TChain does not have friends, you can instead use this ctor: https://root.cern/doc/master/classROOT_1_1RDataFrame.html#adcebb79c00694365296d054626e87742 Further note that this ctor accepts filename globs.

ikabadzhov avatar Jul 16 '22 06:07 ikabadzhov

Hi @KeanuGh ,

thank you very much for your report. This is a lifetime problem that's related to the behavior of the underlying C++, and that makes it extremely hard to solve for this particular constructor.

However, we are introducing what we plan to be a superior alternative to this problematic pattern, RDatasetSpec, used for example as:

def make_df():
   ds = ROOT.RDF.Experimental.RDatasetSpec(treename, paths)
   return ROOT.RDataFrame(ds)

The constructor taking a RDatasetSpec will cover all cases covered by TChain (including friend trees, but probably not indexed friend trees) and it does not have the lifetime issue you describe. RDatasetSpec is available in the master branch since last week.

eguiraud avatar Jul 18 '22 15:07 eguiraud

Thanks for the replies @eguiraud and @ikabadzhov, RDatasetSpec sounds very useful! The only reason that I had been using TChains thus far is that RDataFrames don't seem to be able to handle wildcards in directories (ie path/to/sample*/*.root as opposed to path/to/sample/*.root). I didn't realise that a list of globs could be passed directly to the constructor so that is a solution I am implementing for now.

KeanuGh avatar Jul 18 '22 15:07 KeanuGh

Perhaps it would be possible to throw up an error or warning when this type of issue is encountered? as it took me an embarrassingly long time to find the source of the segfault!

KeanuGh avatar Jul 18 '22 15:07 KeanuGh

RDataFrames don't seem to be able to handle wildcards in directories

can you provide an example of what does not work? RDF forwards the globs directly to TChain so what works for TChain should work for RDF.

Perhaps it would be possible to throw up an error or warning when this type of issue is encountered?

a use-after-delete is undefined behavior in C++ which means (among other things) that in general it cannot be detected. I'll try to discuss with a few people the options we have.

eguiraud avatar Aug 01 '22 20:08 eguiraud

Hello again, we discussed this internally, brainstormed solutions etc. and the outcome is that this cannot be fixed in RDF (or possibly at all): as I mentioned above this is a quirk of managing C++ objects from Python: because of C++ lifetimes and undefined behavior with use-after-deletes, there is no surefire way in which RDF can detect that the TChain object it's referring to went out of scope.

Please do this instead:

def build_rdataframe_2(filepath, tree):                                                                                                                                                     
    return ROOT.RDataFrame(tree, filepath)

or if you need friend trees (from ROOT v6.28):

def build_rdataframe_2(filepath, tree):    
    spec = ROOT.RDF.Experimental.RDatasetSpec()\
       .AddGroup(
            (
                "samples",
                ["subTree1", "subTree2"],
                ["PYspecTestFile5.root", "PYspecTestFile6.root"],
             )
        )\
        .WithGlobalFriends("anotherTree", "PYspecTestFile7.root", "friendTree")
                                                                                                                                                
    return ROOT.RDataFrame(spec)

eguiraud avatar Jan 26 '23 17:01 eguiraud

I've come back to this recently and have found a little workaround. If the TChain is instead declared in C++ via the gInterpreter, it will not be garbage collected. eg:

def fill_rdataframe(paths, trees):
    """create TChain in c++ in order for it not to be garbage collected by python"""
    ROOT.gInterpreter.Declare(
        f"""
            TChain chain;
            void fill_chain() {{
                std::vector<std::string> paths = {{\"{'","'.join(paths)}\"}};
                std::vector<std::string> trees = {{\"{'","'.join(trees)}\"}};
                for (const auto& path : paths) {{
                    for (const auto& tree : trees) {{
                        chain.Add((path + "?#" + tree).c_str());
                    }}
                }}
            }}
        """
    )
    ROOT.fill_chain()

    return ROOT.RDataFrame(ROOT.chain)

Rdf = fill_rdataframe(["file1", "file2"], ["tree1", "tree2"])

the conversion from python lists into c++ vectors is a little messy but it works

KeanuGh avatar Feb 13 '23 13:02 KeanuGh

That is safe: the TChain is declared at global scope, so it will be destroyed only at program exit. Declaring it as a Python global variable (i.e. at module scope) should have the same effect.

eguiraud avatar Feb 13 '23 13:02 eguiraud