root
root copied to clipboard
[hadd] add option to specify which objects should (not) be merged
Hi all! This is my first PR, please be gentle 🙂
The PR is meant to trigger a discussion on the feature described in the title, rather than being a request that it should go in as-is.
Description
Add -l and -e command line options for passing a list of objects to TFileMerger::AddObjectNames and setting the TFileMerger::kSkipListed or TFileMerger::kOnlyListed flags.
Purpose
Allows to merge only certain objects from the list of input files.
Use case
Merging single directories/trees into separate root files. More concrete: when producing nTuples (on the WLCG) with common LHCb tools, one sometimes wants to run different selections on the same input stream and write out a tree for each of the selections. All these trees end up in the same root file.
Shortcomings
- All directories will be copied in allowlist mode, but the ones not on the allowlist will be empty. I didn't look further into this, but it could be that this would have to be changed in
TFileMergerrather thanhadd. - The code is not rigorously tested. If the feature should be added, I am available to do that.
Can one of the admins verify this patch?
Hello! :) Non-functional issue but before actual review may I suggest blocklist / allow list terminology?
Hello! :) Non-functional issue but before actual review may I suggest blocklist / allow list terminology?
Oh, sorry! I've changed the terms
This looks like a nice feature thanks.
I wonder if rather than a single list (that is either an inclusion list or an exclusion list), we should allow both ... humm ... right .. this is not yet supported by TFileMerger :( ...
Thank you for taking a look @pcanal ! I have removed the Stopwatch and the multiprocessing part.
I have looked into multiprocessing further, but even the current version of hadd does not allow multiprocessing when giving more than 6 input files (xrootd) via the command line (as opposed to using @<inputfile_list> which cannot work in the current implementation). TFileMerger stalls after the first input file of each process has been added.
Also, the 2 items listed under shortcomings in the description remain. Let me know if I can do something about them.
(as opposed to using @<inputfile_list> which cannot work in the current implementation).
Why can't it work? Having the @<inputfile_list> should end up being equivalent to listing the files explicitly on the command line? What am I missing?
Adding the @<inputfile_list> is a good idea that (should be) independent on whether the multiprocessing works (I assume you mean the feature where hadd spawns multiple processes ; this feature is helpful for the case where the file contains (many) histograms and is a slight pessimissation if the file contains TTrees).
TFileMerger stalls after the first input file of each process has been added.
Do you have a stack trace of the where the thread/processes stall?
All directories will be copied in allowlist mode,
TFileMerger indeeds creates the directory in the output file before it checks whether they have any content. I suspect we could remove it if its ends up empty.
Why can't it work? Having the @<inputfile_list> should end up being equivalent to listing the files explicitly on the command line? What am I missing?
Sorry for not directly referring to the code here. It counts the number of input arguments rather than the number of input files to merge. If you only give it a single @<inputfile_list>, nProcesses will always be set back to 1.
Adding the @<inputfile_list> is a good idea that (should be) independent on whether the multiprocessing works (I assume you mean the feature where hadd spawns multiple processes ; this feature is helpful for the case where the file contains (many) histograms and is a slight pessimissation if the file contains TTrees).
True, there is probably no speed gain at all for TTrees, since the partial fill will have to be merged again. In conjunction with -dbg it might be still be useful with large input file lists, in case one of the processes fails unexpectedly. But then the question would be if the partial file list is persisted somewhere, so one can run the failed partial merge again...
Do you have a stack trace of the where the thread/processes stall?
With hadd "From tags/v6-20-04@v6-20-04" I got
$ gdb hadd
...
run -j 2 -ff test.root root://[email protected]//eos/lhcb/grid/user/lhcb/user/m/mstahl/ ... (about 20 files)
...
hadd Source file 1: root://[email protected]//eos/lhcb/grid/user/lhcb/user/m/mstahl/2020_07/383694/383694794/Xipih.root
hadd Source file 1: root://[email protected]//eos/lhcb/grid/user/lhcb/user/m/mstahl/2020_07/383694/383694839/Xipih.root
^C
Program received signal SIGINT, Interrupt.
0x00007ffff5cda9a3 in select () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff5cda9a3 in select () from /lib64/libc.so.6
#1 0x00007ffff6ca7c1a in TUnixSystem::UnixSelect(int, TFdSet*, TFdSet*, long) () from /usr/lib64/root/libCore.so.6.20
#2 0x00007ffff6ca8500 in TUnixSystem::DispatchOneEvent(bool) () from /usr/lib64/root/libCore.so.6.20
#3 0x00007ffff6bd2ce6 in TSystem::InnerLoop() () from /usr/lib64/root/libCore.so.6.20
#4 0x00007ffff760bfb3 in TMonitor::Select() () from /usr/lib64/root/libNet.so.6.20
#5 0x0000000000408ec7 in void ROOT::TProcessExecutor::Collect<bool>(std::vector<bool, std::allocator<bool> >&) ()
#6 0x00000000004057d4 in main ()
So hadd stalls after TMerger added the first input file for each process.
In the meantime I ran into another issue that concerns empty directories. In the files I'm trying to merge, it rarely happens that a directory/tree is empty because no events have been selected (for that specific selection). In such a case TFileMerger raises a seg fault (vanilla hadd From tags/v6-20-04@v6-20-04)
#6 0x00007faa3fd0e43e in TFileMerger::MergeRecursive(TDirectory*, TList*, int) () from /cvmfs/sft.cern.ch/lcg/views/LCG_97python3/x86_64-centos7-gcc9-opt/lib/libRIO.so
#7 0x00007faa3fd0d29c in TFileMerger::PartialMerge(int) () from /cvmfs/sft.cern.ch/lcg/views/LCG_97python3/x86_64-centos7-gcc9-opt/lib/libRIO.so
#8 0x0000000000405e9a in main ()
Trying my local version built with debug symbols and running gdb didn't yield further info.
I was a bit puzzled to see this, since I could swear that I successfully merged files with empty directories in the past. And in fact, it works with root 6.18.00 from /cvmfs/sft.cern.ch/lcg/views/LCG_96/x86_64-centos7-gcc8-opt (without complaining - maybe there should be a message with default verbosity settings).
I diff'ed hadd.cxx as well as the TFileMerger source and header files between master and v6-18-00-patches, but did not see anything that would lead to these seg-faults.
Do you have an idea where this could come from?
Since this is only loosely related to the actual PR, it might not be the right place to discuss this. I can post it elsewhere if that would make sense.
Recent use case: https://root-forum.cern.ch/t/excluding-custom-class-from-merging/63637