UnROOT.jl icon indicating copy to clipboard operation
UnROOT.jl copied to clipboard

Working with LazyTree as a DataFrame

Open peremato opened this issue 4 years ago • 23 comments

I'm new to Julia wanting to see its potential for HEP computing. For this, I am playing with a TTree with UnROOT to see how this compares with equivalent job with ROOT/RDataFrame. I understand that aLazyTree is not a DataFrame, since in general you cannot not fit in memory the full TTree, however it provides the basic functionality of an AbstractDataFrame. Naively I thought that I could use initially a LazyTree and then once it is heavily reduced to be converted to a DataFrame. But this seems not to be working. Perhaps you can give me some hints. The code I am trying to execute is as follows:

using UnROOT
f = ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root")
events = LazyTree(f, "Events");
df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], events[1:1000000])

and the error I get is:

MethodError: no method matching getindex(::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}}, ::BitVector, ::Colon)
Closest candidates are:
  getindex(::AbstractDataFrame, ::Integer, ::Colon) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:210
  getindex(::AbstractDataFrame, ::Integer, ::Union{Colon, Regex, AbstractVector{T} where T, All, Between, Cols, InvertedIndex}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:208
  getindex(::AbstractDataFrame, ::CartesianIndex{2}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/other/broadcasting.jl:3
  ...

Stacktrace:
 [1] #filter#88
   @ ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1040 [inlined]
 [2] filter(::Pair{Vector{Symbol}, var"#5#6"}, df::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}})
   @ DataFrames ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1035
 [3] macro expansion
   @ In[5]:3 [inlined]
 [4] top-level scope
   @ timing.jl:210
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

If instead of passing a LazyTree I construct a DataFrame, then it works nicely. However it means, I think, that I have copied in memory the full TTree.

df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], DataFrame(events[1:1000000]))

peremato avatar Aug 27 '21 08:08 peremato

actually, if you do copycols=false, it's lazy DataFrame I believe, but notice the [1:10000] is making a concrete table first. @aminnj has showed me that packages in the eco-system just works, such as https://github.com/queryverse/Query.jl

julia> @time @from evt in t2 begin
         @where length(evt.Jet_pt) > 6
         @let njets=length(evt.Jet_pt)
         @let njets40=sum(evt.Jet_pt.>40)
         @select {njets=njets, njets40, evt.MET_pt}
         @collect DataFrame
       end
  0.322372 seconds (1.02 M allocations: 57.209 MiB, 99.82% compilation time)
831×3 DataFrame
 Row │ njets  njets40  MET_pt
     │ Int64  Int64    Float32
─────┼──────────────────────────
   1 │     7        1   28.2933
   2 │     7        4   38.6646
   3 │     7        6   75.2459
   4 │     8        4   49.9447
   5 │     9        3   80.1362

Moelf avatar Aug 27 '21 10:08 Moelf

to show that DataFrame can be lazy just fine:


julia> @time const mytree = LazyTree(ROOTFile("./test/samples/NanoAODv5_sample.root"), "Events", r"nMuon");
  0.067813 seconds (296.65 k allocations: 25.716 MiB)

julia> @time DataFrame(mytree; copycols=false);
  0.000018 seconds (20 allocations: 1.375 KiB)

julia> @time df = filter(evt->evt.nMuon==2, DataFrame(mytree; copycols=false));
  0.108670 seconds (383.77 k allocations: 19.429 MiB, 17.67% gc time, 98.96% compilation time)

we could probably deligate filter() to the inner table:

julia> inner = getfield(t, :treetable);

julia> filter(evt->evt.nMuon==2, inner)
Table with 2 columns and 175 rows:
      Jet_nMuons                                             nMuon
    ┌─────────────────────────────────────────────────────────────
 1  │ Int32[0, 1, 0, 0, 1, 0, 0]                             2
 2  │ Int32[0, 0, 1, 0, 0, 0, 0, 0, 0]                       2
 3  │ Int32[0, 0, 0, 1, 0, 1, 0]                             2

the second one simply goes through Julia's base filter https://github.com/JuliaLang/julia/blob/1bbba21aa258a99d1ecf1168d72d64cb402fd054/base/array.jl#L2523 first pass collecting index mask and then index the original table with mask. Not the most efficient way, but works.

Moelf avatar Aug 27 '21 11:08 Moelf

however it provides the basic functionality of an AbstractDataFrame

a bit tangential here, actually we don't need DataFrames.jl for any "real" functionality, mainly the printing and some interface I got tired hunting down. Later we could totally drop DataFrames.jl dependency since you can always DataFrame(mytree; copycols=false) to get a DataFrame without blowing up RAM.

I haven't investigated what's the performance pit fall once you wrap it inside a DataFrame.

Moelf avatar Aug 27 '21 11:08 Moelf

Thanks @Moelf for the various hints. The copycols=false is very useful. I'll have a look at Query.jl. It looks very interesting, although I do not know how to do the reduction (aka filling some histograms) without having to materialize a full Array or DataFrame with all the values that I would like to histogram.

peremato avatar Aug 27 '21 12:08 peremato

For filling histogram, likely you will have more than a few histograms to fill anyway so jam all the code into a columnar command is hard to maintain later. https://github.com/Moelf/FHist.jl A loop will work just fine:

h = Hist1D(Int; bins=-3:0.5:3)

for evt in mytree
    if evt.nMuon == 2
        push!(h, count(evt.Jet_nMuons))
    end
end

If you want to do something "aside" in the middle of query, you probably need to ask Query.jl people, since most of the DataFrame community (julia or python) doesn't have this need I guess.

Also, I believe that the LazyBranch from mytree.nMuon works with histogram without blowing up memory, but that's probably not too useful anyway.

Moelf avatar Aug 27 '21 12:08 Moelf

If you have a particular RDataFrame use-case in mind, that is nice AND scalable such that you can do a full-blown analysis with just RDataFrame query-y commands, I'm happy to match the semantics with some utility macros. Otherwise I feel like RDataFrame is just for quick check (which is probably easier to do in Julia[1] with all DataFrame ecosystem available), and eventually one starts write the analysis loop anyway.

[1]: albeit slower by a bit, but does it matter in interactive exploration stage of an analysis?

Moelf avatar Aug 27 '21 13:08 Moelf

First I have the personal challenge to fill an histogram with all the events in the ROOT file (~65M) without writing explicitly the event loop. The following code blows-up if I use the full file:

using UnROOT, Query, FHist, StatsBase, DataFrames

function invariantMass(pt, eta, phi, mass)
   x, y, z = pt .* cos.(phi), pt .* sin.(phi), pt .* sinh.(eta)
   e = sqrt.( x.^2 + y.^2 + z.^2 + mass.^2)
   sqrt(sum(e)^2 - sum(x)^2 - sum(y)^2 - sum(z)^2) 
end

events = LazyTree(ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"), "Events")

@time df = events |>
    #@take(1000000) |>
    @filter(_.nMuon == 2) |>
    @filter(_.Muon_charge[1] != _.Muon_charge[2]) |>
    @map({inv_mass = invariantMass(_.Muon_pt, _.Muon_eta, _.Muon_phi, _.Muon_mass)}) |>
    DataFrame
h = Hist1D(log10.(df.inv_mass), -1.:0.005:2.5)

Second, we are developing RDataFrame exactly for the purpose of not having the explicit event loop and ensuring that the different selections, reductions (filling histograms), etc. are all executed within a single pass, thus optimizing I/O, and with the possibility in the back of further optimization, parallel execution, data caching, etc. I'll provide you with an example of some complexity to see how this could be written in Julia.

peremato avatar Aug 27 '21 14:08 peremato

yes, your example makes a copy at the end, because the Query produces a collection of NamedTuples.

not having the explicit event loop

I doubt real world analysis expressed in RDataFrame operation chain will be less complex than a loop in Julia. Can you for example show how to find the pair of best Z-candidate lepton pair among both muons and electrons (bonus: if each of them has to pass their own family specific quality cut) and fill a histogram that shows a Z-peak? I find it difficult to imagine how to do it without loops.

The point is that for-loop in Julia is trivial: no binding variables etc. and users can parallel them however they like, explicitly.

Moelf avatar Aug 27 '21 14:08 Moelf

btw there's a CMS NanoAOD in the ./test/samples of this repo, it's easier to have a common test sample so we copy-paste and run. (accessing opendata turns out to be more complicated than I thought)

Moelf avatar Aug 27 '21 14:08 Moelf

You can download it from here: http://opendata.web.cern.ch/record/12341 I am using it from SWAN with Bleeding Edge software stack

peremato avatar Aug 27 '21 15:08 peremato

I would do your example this way:

julia> using UnROOT, LVCyl, FHist

julia> const events = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events");

julia> const h = Hist1D(Int; bins = -1.:0.005:2.5);

@time for evt in events
           evt.nMuon != 2 && continue
           sum(evt.Muon_charge) != 0 && continue
           lv1, lv2 = LorentzVectorCyl.(evt.Muon_pt, evt.Muon_eta, evt.Muon_phi, evt.Muon_mass)
           inv_mass = fast_mass(lv1, lv2)
           #we know we're only filling from one thread
           unsafe_push!(h, log10(inv_mass))
       end
 24.747783 seconds (24.62 M allocations: 31.011 GiB, 2.13% gc time) #note: this is cumulative allocation

LVCyl from https://github.com/JuliaHEP/LVCyl.jl, which also provides the fast_mass

julia> length(events)
61540413

julia> ans / 24.74
2.487486378334681e6

2.5 MHz isn't too bad I hope.

Moelf avatar Aug 27 '21 15:08 Moelf

A couple of examples with some complexity using RDataFrame and accessing Open Data. https://github.com/cms-opendata-analyses/HiggsTauTauNanoAODOutreachAnalysis https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html It would be nice to compare them with pure Julia equivalent.

peremato avatar Aug 30 '21 13:08 peremato

Those are way too long and I can't quite figure out where does some functions in string comes from for example pt_cuts() in the second link (.h file somewhere?).

Is it not possible to have a single-objective example? Each of them have too many histograms weaved in, making exact comparison will be impossible because too many non-event-loop related cost.

Moelf avatar Aug 30 '21 13:08 Moelf

Indeed, the pure Julia equivalent would be completely different. In the example above, there are many high-level cuts (btw. the string-evaluation-style coding is really awful for clarity) which do multiple iterations over and over again. In Julia, you'd likely do it straight forward in a simple loop.

Just to show a dumb example; people doing data reduction in Python tend to work with Pandas DataFrames à la:

>>> df = df[df.a < 10]
>>> df = df[np.abs(df.b < 1) & (np.c > 0)]
>>> df = df[df.a + df.b < 0.4]
>>> ...

in this example, 10 temporary datasets are created and there are multiple loops over the same elements over and over. Depending on the data size, this can cause enormous memory overhead and heavy calculations lead to wasted CPU resources too.

In Julia you'd most likely iterate once and build up your final dataset element-wise, step-by-step. Clear, concise and with a low memory footprint and efficient CPU usage.

...and the best thing is: you don't have to learn fancy slicing tricks to circumvent the above mentioned issues, you just code your algorithm straight forward.

In my opinion, it would be an easy exercise to work through https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html and implement it in Julia, but it requires time, which probably should be invested by someone who needs it ;)

tamasgal avatar Aug 30 '21 13:08 tamasgal

Those are way too long and I can't quite figure out where does some functions in string comes from for example pt_cuts() in the second link (.h file somewhere?).

The header file is https://github.com/root-project/root/blob/master/tutorials/dataframe/df103_NanoAODHiggsAnalysis_python.h

Is it not possible to have a single-objective example? Each of them have too many histograms weaved in, making exact comparison will be impossible because too many non-event-loop related cost.

I understand it is quite a lot of work, but comparing the performance, code clarity, debugability, concurrency, etc. of an analysis like this from the ROOT input file to the final set of histograms it is worth to having convincing elements to the HEP community of the advantages of Julia versus C++/Python.

peremato avatar Aug 30 '21 14:08 peremato

Indeed, the pure Julia equivalent would be completely different. In the example above, there are many high-level cuts (btw. the string-evaluation-style coding is really awful for clarity) which do multiple iterations over and over again. In Julia, you'd likely do it straight forward in a simple loop.

In RDataFrame all the actions (definitions, filters, etc.) are lazy. There is indeed a single event loop over the full analysis. I do agree that the string-evaluation-style coding is really awful for clarity. This is the main reason why Julia could be very attractive is a reasonable performance is preserved.

Just to show a dumb example; people doing data reduction in Python tend to work with Pandas DataFrames à la:

>>> df = df[df.a < 10]
>>> df = df[np.abs(df.b < 1) & (np.c > 0)]
>>> df = df[df.a + df.b < 0.4]
>>> ...

in this example, 10 temporary datasets are created and there are multiple loops over the same elements over and over. Depending on the data size, this can cause enormous memory overhead and heavy calculations lead to wasted CPU resources too.

No, if df is a RDataFrame, then there is no many loops nor many instantiations of temporary datasets. Single loop and no large temporaries.

In Julia you'd most likely iterate once and build up your final dataset element-wise, step-by-step. Clear, concise and with a low memory footprint and efficient CPU usage. ...and the best thing is: you don't have to learn fancy slicing tricks to circumvent the above mentioned issues, you just code your algorithm straight forward.

Yes, this is clear. But what you would like is to be able to partition the loop over many cores, workers, nodes, etc. in a fairly easy or transparent manner.

In my opinion, it would be an easy exercise to work through https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html and implement it in Julia, but it requires time, which probably should be invested by someone who needs it ;)

I do agree, int requires work and it should be invested somehow.

peremato avatar Aug 30 '21 14:08 peremato

My fear is that due to the auxiliary code around making plots and histogram and name things, it won't be clear to readers because it's too long to hold in (our) short memory.

To make a constructive example, I think comparing the double loop: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html#l00026

would be clear. And then compare two filters, and comparing making two histograms. Basically, comparing 20 filters and making 20 histograms doesn't add to the comparison, and IMHO, only making examples impossible to comprehend unless users already know Julia

Moelf avatar Aug 30 '21 14:08 Moelf

No, if df is a RDataFrame, then there is no many loops nor many instantiations of temporary datasets. Single loop and no large temporaries.

Ah OK, sorry for my ignorance. This makes the comparison even more interesting :)

tamasgal avatar Aug 30 '21 14:08 tamasgal

To make a constructive example, I think comparing the double loop: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html#l00026 would be clear. And then compare two filters, and comparing making two histograms.

Yes, I do agree it is good to start with.

peremato avatar Aug 30 '21 14:08 peremato

ok let me try to write a blog-like article and compare a few "kernel" functions like that and build up to a mini full analysis with two cuts and filling two histograms

Moelf avatar Aug 30 '21 14:08 Moelf

I'm writing the said document already and want to put some key observation here for posterity without referring to the full document. This is based on example of: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py_source.html, the function defined on line 145. If you look closely, line 151 and 159 correspond to two functions in: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html header file. In this header file, pay attention to line 41 and line 71.

We see that the invariant mass of the pairs of muons are computed multiple times. This is potentially expensive and we can imagine the same situation happening for more expensive variable computation. Now, there's .Define(), but it is awkward to use especially if your expensive variable is optional and changes from event to event (e.g. not sure which two leptons make up the Z).

You may also think we can merge the two "kernel" functions into one, the problem is that now you need to merge everything in between too, because these two .Filters() are not called right after another. => basically you ends up writing a bigger and bigger (would be) for-loop body when you merge kernels for performance reason.

Moelf avatar Sep 01 '21 21:09 Moelf

I am not sure I am following, although I do see that there is some recalculation that could be perhaps saved between the filtering steps. The .Define() adds virtually a new column to the dataframe and works event by event. It is equivalent to declare and set a temporary variable in the event loop, which can be used in a subsequent filter expression. Instead of 'defining' the indices of the leptons to form the Zs you could also return in addition the masses between them. In C++ is not as easy as in Julia and you would need to return a std::pair<>. In my opinion this is not to important here at this stage.

peremato avatar Sep 02 '21 07:09 peremato

It is equivalent to declare and set a temporary variable in the event loop, which can be used in a subsequent filter expression

indeed, but consider this pseudo code:

def filter_kernel1():
	condition1 = calculation...
	if condition1:
		variable1 = ...
	else:
		return false
	return true

def filter_kernel2():
	calculation_with_early_exit...
	return true

def filter_kernel3():
	#re-use variable1 here means make condition1, and variable1 two `.Define()`s
    #or merging 1,2,3 into a big blob

and in filter_kernel3, depending on if variable1 exists, you want to re-use it. And there are other Filters between these two. Now the difficulty is you need to "break-up" every kernel at places where you have new variables and conditionals (manual SSA?), or, you can write them in one kernel function, but only if you also write everything logically between kernel1 and kernel3 into the same function as well.

My over all impression is that, there seem to be not much gain in terms of clarity (string-based, split in C++ and python) and functionality/performance (users still need to know the order of filters, not to repeat calculation and split into .Define() for intermediate variables and break up or merge kernel functions accordingly), when compared to a for-loop that is as readable as pseudo code.

Moelf avatar Sep 02 '21 11:09 Moelf

btw, since:

mutable struct LazyBranch{T,J,B} <: AbstractVector{T}
    f::ROOTFile
...

converting a lazy tree it to a DataFrame(; copycols = false) should just work, we can maybe have integration test one day.

Moelf avatar Nov 12 '22 04:11 Moelf