awkward icon indicating copy to clipboard operation
awkward copied to clipboard

ak.from_rdataframe segmentation fault

Open pviscone opened this issue 1 year ago • 12 comments

Version of Awkward Array

2.3.1

Description and code to reproduce

Hi, A segmentation fault occurs when from_rdataframe is called on a huge rdataframe. In my case, if I create the rdataframe with 11 NanoAODs (13994000 events) everything works fine, but if I create the rdataframe with 12 NanoAODs (15274000 events) or more, a segmentation fault occurs.

I have already verified that:

  • is independent of EnableImplicitMT.
  • choosing other files changes nothing
import ROOT
import awkward as ak
import glob

folder=glob.glob("/scratchnvme/pviscone/Preselection_Skim/powheg/root_files/predict/*.root")

df=ROOT.RDataFrame("Events",folder[:12])
print("Events: ",df.Count().GetValue())

columns=["Jet_pt",
        "Jet_phi",
        "Jet_eta",
        "Jet_btagDeepFlavB",
        "Jet_btagDeepFlavCvB",
        "Jet_btagDeepFlavCvL",
        "Muon_pt",
        "Muon_phi",
        "Muon_eta",
        "Electron_pt",
        "Electron_phi",
        "Electron_eta",
        "LHEPart_pdgId",
        "MET_pt",
        "MET_phi",
        "Photon_pt",
        "Photon_phi",
        "Photon_eta",]
        
ak_array=ak.from_rdataframe(df,columns=columns)
Warning in <TClass::Init>: no dictionary for class edm::Hash<1> is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessHistory is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessConfiguration is available
Warning in <TClass::Init>: no dictionary for class edm::ParameterSetBlob is available
Warning in <TClass::Init>: no dictionary for class pair<edm::Hash<1>,edm::ParameterSetBlob> is available
Events:  15274000
[1]    1441 segmentation fault (core dumped)  python -i try.py

This is the output of the analysis of the core dump with pystack:

❯ pystack core core.14211 -v --exhaustive --locals
Using executable found in the core file: /scratchnvme/pviscone/env/bin/python
INFO(process_core): Using library search path: 

Core file information:
state: R zombie: True niceness: 0
pid: 14211 ppid: 187369 sid: 187369
uid: 4354 gid: 1144 pgrp: 14211
executable: python arguments: python -i try.py 

The process died due receiving signal SIGSEGV
INFO(process_core): Analyzing core file core.14211 with executable /scratchnvme/pviscone/env/bin/python using stack method StackMethod.ALL with native mode NativeReportingMode.OFF
INFO(parse_maps_file_for_binary): python binary first map found: VirtualMap(start=0x000056356cb06000, end=0x000056356cb60000, filesize=0x1000, offset=0x3a000, device='', flags='r', inode=0, path='/scratchnvme/pviscone/env/bin/python')
INFO(parse_maps_file_for_binary): Process does not have a libpython.so, reading from binary
INFO(_get_bss): Determined exact addr of .bss section: 0x56356ce9eac0 (0x56356cb06000 + 0x398ac0)
INFO(parse_maps_file_for_binary): bss map found: VirtualMap(start=0x000056356ce9eac0, end=0x000056356cea4690, filesize=0x5bd0, offset=0x71ac0, device='', flags='', inode=0, path='None')
INFO(process_core): Attempting to find symbol 'Py_Version' in python
INFO(_get_python_version_from_map_information): Trying to extract version from filename: python
INFO(_get_python_version_from_map_information): Could not find version by looking at library or binary path: Trying to get it from running python --version
INFO(_get_python_version_from_map_information): Python version determined: 3.10
INFO(process_core): Attempting to find symbol '_PyRuntime' in python
INFO(process_core): Symbol '_PyRuntime' found at address 0x56356ce9e820
INFO(process_core): An active interpreter has been detected
INFO(process_core): Trying to resolve PyInterpreterState from Elf data
INFO(process_core): Searching for PyInterpreterState based on PyRuntime address 0x56356ce9e820
INFO(process_core): PyInterpreterState found by using ELF data at address 0x56356DBDB7D0
INFO(process_core): Fetching Python threads
INFO(process_core): Constructing new Python thread with tid 14211
Traceback for thread 14211 [Has the GIL] (most recent call last):
    (Python) File "/scratchnvme/pviscone/Vcb_ditopDecay/tasks/combine/systematics/try.py", line 33, in <module>
        a=ak.from_rdataframe(df,columns=columns)
    (Python) File "/scratchnvme/pviscone/env/lib/python3.10/site-packages/awkward/_dispatch.py", line 20, in dispatch
        gen_or_result = func(*args, **kwargs)
      Locals:
        kwargs: {"columns": ["Jet_pt", "Jet_phi", "Jet_eta", "Jet_btagDeepFlavB", ...]}
        args: (<RDataFrame at 0x7f9a7e6182c0>)
    (Python) File "/scratchnvme/pviscone/env/lib/python3.10/site-packages/awkward/operations/ak_from_rdataframe.py", line 49, in from_rdataframe
        return _impl(rdf, columns, highlevel, behavior, with_name, offsets_type, keep_order)
      Arguments:
        columns: ["Jet_pt", "Jet_phi", "Jet_eta", "Jet_btagDeepFlavB", ...]
        rdf: <RDataFrame at 0x7f9a7e6182c0>
      Locals:
        behavior: None
        highlevel: True
        offsets_type: "int64"
        with_name: None
        keep_order: False
    (Python) File "/scratchnvme/pviscone/env/lib/python3.10/site-packages/awkward/operations/ak_from_rdataframe.py", line 81, in _impl
        out = ak._connect.rdataframe.from_rdataframe.from_rdataframe(
      Arguments:
        offsets_type: "int64_t"
        behavior: None
        highlevel: True
        columns: ("Jet_pt", "Jet_phi", "Jet_eta", "Jet_btagDeepFlavB", ...)
        keep_order: False
        with_name: None
        data_frame: <RDataFrame at 0x7f9a7e6182c0>
      Locals:
        project: False
        awkward: <module at 0x7f9a7e6360c0>
    (Python) File "/scratchnvme/pviscone/env/lib/python3.10/site-packages/awkward/_connect/rdataframe/from_rdataframe.py", line 225, in from_rdataframe
        length = cpp_buffers_self.to_char_buffers[builder_type](builder)
      Arguments:
        offsets_type: "int64_t"
        behavior: None
        highlevel: True
        columns: ("Jet_pt", "Jet_phi", "Jet_eta", "Jet_btagDeepFlavB", ...)
        keep_order: False
        with_name: None
        data_frame: <RDataFrame at 0x7f9a7e6182c0>
      Locals:
        names_nbytes: <map<string,unsigned long> at 0x7f99db96c300>
        done: True
        fill_from_func: <cppyy.TemplateProxy at 0x7f99dac5d850>
        builder: <ListOffset<long,awkward::LayoutBuilder::Numpy<int> > at 0x7f9a7e618840>
        cpp_buffers_self: <CppBuffers at 0x7f99dae93400>
        buffers: {"node0-offsets": <numpy.ndarray at 0x7f99dac68630>, ...}
        CppBuffers: <CppBuffers_meta at 0x563588186e00>
        ListOffsetBuilder: <ListOffset<long,awkward::LayoutBuilder::Numpy<int> >_meta at 0x5635d2e62820>
        builder_type: "awkward::LayoutBuilder::ListOffset<long,awkward::LayoutBuilder::Numpy<int> >"
        cpp_fill_function: <function at 0x7f99daf4b490>
        column_types: {"Jet_pt": "ROOT::VecOps::RVec<Float_t>", ...}
        result_ptrs: {...}
        length: 15274000
        index: {}
        awkward_type_cols: {}
        form: <ListOffsetForm at 0x7f99dac595d0>
        contents: {"Jet_pt": <Array at 0x7f99dac59330>, ...}
        form_str: "{"class": "ListOffsetArray", "offsets": "i64", "content":{"class": "NumpyArr...
        list_depth: 2
        maybe_indexed: False
        col: "LHEPart_pdgId"
        form_dtype_name: "int32"
        item: <pair<const string,unsigned long> at 0x7f99dac42ec0>
        data_type: "int32_t"

ROOT version : 6.28 awkward version : 2.3.1 python version : 3.10.10

pviscone avatar Jul 19 '23 20:07 pviscone

I'm currently AFK, so I can't check this myself - how much memory does the result occupy if you load the 11-nanoaod dataframe? I.e what is the nbytes of the result. Correspondingly, how much memory does the system have?

My hunch is that there is insufficient free memory to allocate, thus the program segfaults upon writing to the (non)allocated result.

agoose77 avatar Jul 20 '23 09:07 agoose77

The nbytes of the 11 files case is 5750300292 (~5.7GB).

The system has 2TB of RAM so I really exclude that the problem is the lack of free memory.

Furthermore, I have tried to build 2 different rdataframe, convert them into 2 different awkward array calling from_rdataframe, and then merge them into a single awkward array and everything works fine.

pviscone avatar Jul 20 '23 09:07 pviscone

I have tried with all the 93 files that I have creating Rdataframe in bunches of 10 files, converting them to awkward arrays and then concatenating together and it works.

The final ak_array nbytes is 44377449808 (~44GB) and contains ~110.000.000 events.

import ROOT
import awkward as ak
import glob
import numpy as np

folder=glob.glob("/scratchnvme/pviscone/Preselection_Skim/powheg/root_files/predict/*.root")

rdf_list=[]
for i in range(np.ceil(len(folder)/10).astype(int)):
        rdf_list.append(ROOT.RDataFrame("Events",folder[10*i:10*(i+1)]))

columns=["Jet_pt",
        "Jet_phi",
        "Jet_eta",
        "Jet_btagDeepFlavB",
        "Jet_btagDeepFlavCvB",
        "Jet_btagDeepFlavCvL",
        "Muon_pt",
        "Muon_phi",
        "Muon_eta",
        "Electron_pt",
        "Electron_phi",
        "Electron_eta",
        "LHEPart_pdgId",
        "MET_pt",
        "MET_phi",
        "Photon_pt",
        "Photon_phi",
        "Photon_eta",]

ak_arr=ak.from_rdataframe(rdf_list[0],columns=columns)
for rdf in rdf_list[1:]:
        ak_arr=ak.concatenate([ak_arr,ak.from_rdataframe(rdf,columns=columns)],axis=0)

pviscone avatar Jul 20 '23 12:07 pviscone

I can't reproduce this locally with an arbitrary ROOT file ~5GB (in RAM), and I'm doubtful I'll be able to make much headway like that. Would you be able to perform e.g. a bisect search to figure out which column(s) trigger this, so that we can make a minimum reproducer? Ideally a single column.

agoose77 avatar Jul 20 '23 17:07 agoose77

The number of columns (and which column I choose) seems to be irrelevant.


I have tried to do something more general without loading files, simply creating a RDataFrame in place. I instantiated RDataFrames of different lengths in a loop to see when it crashes.

import ROOT
import awkward as ak
import numpy as np

def rdf2ak(n):
       rdf=ROOT.RDataFrame(int(n)).Define("Dummy","1.")
       ak_arr=ak.from_rdataframe(rdf,columns=["Dummy"])
       return ak_arr

values=np.linspace(133e6,135e6,20,dtype=int)

for value in values:
       print(value)
       rdf2ak(value)
❯ python -i try.py 
133000000
133105263
133210526
133315789
133421052
133526315
133631578
133736842
133842105
133947368
[1]    90645 segmentation fault (core dumped)  python -i try.py

I have also tried other values. The number of entries that makes it crash fluctuates, but is always around 134,000,000.


If I fill the RDataFrame with RVec(10,1.) the number of entries that makes it crash is exactly 1/10th of the previous one

import ROOT
import awkward as ak
import numpy as np

def rdf2ak(n):
       rdf=ROOT.RDataFrame(int(n)).Define("Dummy","ROOT::RVec<float>(10,1.)")
       ak_arr=ak.from_rdataframe(rdf,columns=["Dummy"])
       return ak_arr

values=np.linspace(133e5,135e5,20,dtype=int)

for value in values:
       print(value)
       rdf2ak(value)

❯ python -i try_rvec.py 
13300000
13310526
13321052
13331578
13342105
13352631
13363157
13373684
13384210
[1]    93274 segmentation fault (core dumped)  python -i try_rvec.py

Repeating the same trial but defining 2 columns changes nothing.

import ROOT
import awkward as ak
import numpy as np

def rdf2ak(n):
       rdf=(ROOT.RDataFrame(int(n)).Define("Dummy","ROOT::RVec<float>(10,1.)")
                                   .Define("Dummy2","ROOT::RVec<float>(10,2.)"))
       ak_arr=ak.from_rdataframe(rdf,columns=["Dummy","Dummy2"])
       return ak_arr

values=np.linspace(133e5,135e5,20,dtype=int)

for value in values:
       print(value)
       rdf2ak(value)
❯ python -i try_rvec2col.py 
13300000
13310526
13321052
13331578
13342105
13352631
13363157
13373684
13384210
[1]    95454 segmentation fault (core dumped)  python -i try_rvec2col.py

so it seems that if there are more than ~134 10^6 floats (~500MB) in a column, it crashes

pviscone avatar Jul 20 '23 19:07 pviscone

This is sufficient for me to reproduce locally. It seems that we crash in the writing of the buffers from our growable buffer implementation. I will take a look and find the cause.

agoose77 avatar Jul 21 '23 15:07 agoose77

This is sufficient for me to reproduce locally. It seems that we crash in the writing of the buffers from our growable buffer implementation. I will take a look and find the cause.

I cannot reproduce it:

>>> import ROOT
>>> import awkward as ak
>>> import numpy as np
>>> 
>>> def rdf2ak(n):
...        rdf=ROOT.RDataFrame(int(n)).Define("Dummy","ROOT::RVec<float>(10,1.)")
...        ak_arr=ak.from_rdataframe(rdf,columns=["Dummy"])
...        return ak_arr
... 
>>> values=np.linspace(133e5,135e5,20,dtype=int)
>>> for value in values:
...        print(value)
...        rdf2ak(value)
... 
13300000
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13300000 * {Dummy: va...'>
13310526
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13310526 * {Dummy: va...'>
13321052
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13321052 * {Dummy: va...'>
13331578
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13331578 * {Dummy: va...'>
13342105
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13342105 * {Dummy: va...'>
13352631
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13352631 * {Dummy: va...'>
13363157
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13363157 * {Dummy: va...'>
13373684
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13373684 * {Dummy: va...'>
13384210
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13384210 * {Dummy: va...'>
13394736
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13394736 * {Dummy: va...'>
13405263
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13405263 * {Dummy: va...'>
13415789
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13415789 * {Dummy: va...'>
13426315
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13426315 * {Dummy: va...'>
13436842
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13436842 * {Dummy: va...'>
13447368
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13447368 * {Dummy: va...'>
13457894
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13457894 * {Dummy: va...'>
13468421
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13468421 * {Dummy: va...'>
13478947
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13478947 * {Dummy: va...'>
13489473
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13489473 * {Dummy: va...'>
13500000
<Array [{Dummy: [1, 1, ..., 1, 1]}, ..., {...}] type='13500000 * {Dummy: va...'>
>>> 

ianna avatar Jul 21 '23 17:07 ianna

@ianna 

I have tried to run the same code on my laptop, and it crashes at a slight higher value than before (I was using another machine).
There isn't a precise value that makes it crash, have you tried also higher values?

If you can't reproduce the error, could this problem be system-dependent? 

I really don't know how from_rdataframe is implemented, but there are some differences across systems. For example, mutex has a different implementation in Linux and macOS, and you can see the effect easily, for example, using fork() in the ROOT shell. 

I have run it only on Linux machines (arch and CentOS), do you have a different system?

pviscone avatar Jul 21 '23 17:07 pviscone

On my system, the value 134052631 triggers the segfault @ianna (not exclusively). Not yet sure why.

agoose77 avatar Jul 21 '23 20:07 agoose77

@pviscone - I'm running on darwin macOS 11.6 Clang 14.0.6, Python 3.10.8, and

>>> ROOT.__version__
'6.26/10'
>>> ak.__version__
'2.3.1'
>>> np.__version__
'1.23.5'

I have tried to go to the higher values - I observe high CPU consumption, but no segfault... I'll try to check with the later versions of ROOT and Python.

ianna avatar Jul 22 '23 01:07 ianna

Well, this is strange ... I was able to reproduce this but suddenly am no longer able. When it was segfaulting, it didn't seem to be the memcpy, which was what I'd expected.

agoose77 avatar Jul 22 '23 08:07 agoose77

Well, this is strange ... I was able to reproduce this but suddenly am no longer able. When it was segfaulting, it didn't seem to be the memcpy, which was what I'd expected.

I'll try to run it with a later version of ROOT and also valgrind it on CentOS.

ianna avatar Jul 23 '23 14:07 ianna

sorry for the huge delay but today I was able to test the script on a Mac and it works so I really suspect it could be system-dependent

pviscone avatar Apr 02 '24 12:04 pviscone

Thanks for checking again! Enough things have changed since last summer that we'd want to see this bug reproduced before digging into it, so I'll close the issue now. If you or anyone else sees what seems to be this same bug again, add a comment here and I can reopen the issue.

jpivarski avatar Apr 02 '24 14:04 jpivarski