UnROOT.jl
UnROOT.jl copied to clipboard
Reading structs into higher-dimensional Arrays
ROOT offers different ways to store arrays or even structs in TBranches
. I added a small ROOT-File (test_array.root.zip) which contains a TTree
called 'arrays' which contains a scalar, a 6d-vector and a 2x3-matrix and another TTree
called 'structs' which contains again the same scalar as 'arrays' and a struct which resembles again a 2x3-matrix but stored as a 6d-vec and is defined in the following way
struct m23 {double el[6];};
So the file was created in the following way (may be useful when trying to retrieve data in Julia):
struct m23 {double el[6];};
m23 mat_23 = {1.0,2.0,3.0,4.0,5.0,6.0};
int n = 1;
double vec[6] = {1.0,2.0,3.0,4.0,5.0,6.0};
double mat[2][3] = {{1.0,2.0,3.0},{4.0,5.0,6.0}};
TFile *file = new TFile("test_array.root","Recreate");
TTree *tree_arr = new TTree("arrays","tree filled with arrays")
TTree *tree_str = new TTree("structs","tree filled with struct");
tree_arr -> Branch("nInt",&n);
tree_arr -> Branch("6dVec",&vec,"[6]/D");
tree_arr -> Branch("2x3Mat",&mat,"[2][3]/D");
tree_arr -> Fill();
tree_str -> Branch("nInt",&n);
tree_str -> Branch("2x3mat",&mat,"el[6]/D");
tree_str -> Fill();
file -> Write("",TObject::kOverwrite);
In the ROOT command-line the TTrees
look like this in the end
root [1] arrays -> Scan();
***********************************************************
* Row * Instance * nInt * 6dVec.6dV * 2x3Mat.2x *
***********************************************************
* 0 * 0 * 1 * 1 * 1 *
* 0 * 1 * 1 * 2 * 2 *
* 0 * 2 * 1 * 3 * 3 *
* 0 * 3 * 1 * 4 * 4 *
* 0 * 4 * 1 * 5 * 5 *
* 0 * 5 * 1 * 6 * 6 *
***********************************************************
root [2] structs -> Scan();
***********************************************
* Row * Instance * nInt * 2x3mat.2x *
***********************************************
* 0 * 0 * 1 * 1 *
* 0 * 1 * 1 * 2 *
* 0 * 2 * 1 * 3 *
* 0 * 3 * 1 * 4 *
* 0 * 4 * 1 * 5 *
* 0 * 5 * 1 * 6 *
***********************************************
When I try to read this in Julia using the UnROOT package I get the following
julia> file=ROOTFile("/hosts/nashome/riederer_bernd/test_array.root")
ROOTFile("/hosts/nashome/riederer_bernd/test_array.root") with 2 entries and 17 streamers.
julia> keys(file)
2-element Array{String,1}:
"arrays"
"structs"
julia> array(file,"arrays")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
[1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
[2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
[3] top-level scope at none:0
julia> array(file,"structs")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
[1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
[2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
[3] top-level scope at none:0
julia> DataFrame(file,"arrays")
1×3 DataFrame
│ Row │ nInt │ 6dVec │ 2x3Mat │
│ │ Int32 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
julia> DataFrame(file,"structs")
1×2 DataFrame
│ Row │ nInt │ 2x3mat │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 1.0 │
It seems that, while array
is refusing to handle multidimensional objects in a TBranch
, the function DataFrame
is able to get the scalar and the first entry of each multidimensional object.
I tested this with another file and DataFrame
returns correctly all Rows of scalar-entries and next to it in each case the first element of the higher-dimensional object. So it looks like this:
julia> DataFrame(file,"structs")
1×5 DataFrame
│ Row │ nInt │ 2x3mat │
│ │ Int32 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ mat.el[0] │
│ 2 │ 2 │ mat.el[0] │
│ 3 │ 3 │ mat.el[0] │
│ 4 │ 4 │ mat.el[0] │
I would like to request to extend the support for multidimensional objects at least for the array
function.
Indeed, the problem here is that currently there is no support for so-called jagged arrays from multiple leaves in array()
.
The structure information of the leaves is encoded in the fLeaves
of the TTree
and there you can see the multi-dimensionality is basically given by the fTitle
, which is [6]
for 6dVec
(meaning it's a TLeafD
with 6 values per entry) and [2][3]
for 2x3Mat
, the 2x3 matrix, obviously. Here is the output where you can inspect the meta information of the three leaves:
julia> f["arrays"].fLeaves
UnROOT.TObjArray("", 0, Any[UnROOT.TLeafI
fName: String "nInt"
fTitle: String "nInt"
fLen: Int32 1
fLenType: Int32 4
fOffset: Int32 0
fIsRange: Bool false
fIsUnsigned: Bool false
fLeafCount: UInt32 0x00000000
fMinimum: Int32 0
fMaximum: Int32 0
, UnROOT.TLeafD
fName: String ""
fTitle: String "[6]"
fLen: Int32 6
fLenType: Int32 8
fOffset: Int32 0
fIsRange: Bool false
fIsUnsigned: Bool false
fLeafCount: UInt32 0x00000000
fMinimum: Float64 0.0
fMaximum: Float64 0.0
, UnROOT.TLeafD
fName: String ""
fTitle: String "[2][3]"
fLen: Int32 6
fLenType: Int32 8
fOffset: Int32 0
fIsRange: Bool false
fIsUnsigned: Bool false
fLeafCount: UInt32 0x00000000
fMinimum: Float64 0.0
fMaximum: Float64 0.0
])
I think that parsing this string is the only way to assemble the correct structure, so the first thing is to dynamically create the correct type (Julia struct
) for this.
Here is a very basic (and ugly) hack to get the first entry:
julia> @UnROOT.io struct A6DVec
a::Float64
b::Float64
c::Float64
d::Float64
e::Float64
f::Float64
end
julia> UnROOT.unpack(IOBuffer(array(f, "arrays/6dVec"; raw=true)[1]), A6DVec)
A6DVec(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)
The @UnROOT.io
macro can be used to create a struct
and a corresponding unpack
method which will use big-endian parsing for all the fields.
I also looked at the file with uproot
and uproot4
(the Python package I usually use as a reference and to debug) but apparently it has problems with it. It would be nice if we also create an issue in https://github.com/scikit-hep/uproot4 or just ping @jpivarski (sorry Jim 🙈)
If you want, I can do it for you (create an issue there) and discuss the details, since I guess I first need to understand all the underlying stuff before I can think of an approach to tackle this in the Julia implementation.
Here is a quick example how it fails in uproot4
:
In [1]: import uproot4
In [2]: f = uproot4.open("/Users/tamasgal/Downloads/test_array.root")
In [3]: f["arrays"]["6dVec"].array()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-0ed0e403bdf9> in <module>
----> 1 f["arrays"]["6dVec"].array()
~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
1620 interpretation_executor,
1621 library,
-> 1622 arrays,
1623 )
1624
~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays)
516
517 elif isinstance(obj, tuple) and len(obj) == 3:
--> 518 uproot4.source.futures.delayed_raise(*obj)
519
520 else:
~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/source/futures.py in delayed_raise(exception_class, exception_value, traceback)
35 exec("raise exception_class, exception_value, traceback")
36 else:
---> 37 raise exception_value.with_traceback(traceback)
38
39
~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in basket_to_array(basket)
482 len(basket_arrays[basket.basket_num]),
483 interpretation,
--> 484 branch.file.file_path,
485 )
486 )
ValueError: basket 0 in tree/branch /arrays;1:6dVec has the wrong number of entries (expected 1, obtained 6) when interpreted as AsDtype('>f8')
in file /Users/tamasgal/Downloads/test_array.root
In [4]: f["arrays"]["6dVec"].show()
name | typename | interpretation
---------------------+----------------------+-----------------------------------
6dVec | double | AsDtype('>f8')
Btw. you can see that both uproot
and uproot4
are able to nicely read the structs:
In [5]: f["arrays"].show()
name | typename | interpretation
---------------------+----------------------+-----------------------------------
nInt | int32_t | AsDtype('>i4')
6dVec | double | AsDtype('>f8')
2x3Mat | double | AsDtype('>f8')
In [6]: f["structs"].show()
name | typename | interpretation
---------------------+----------------------+-----------------------------------
nInt | int32_t | AsDtype('>i4')
2x3mat | double[6] | AsDtype("('>f8', (6,))")
In [7]: f["structs/2x3mat"].array()
Out[7]: <Array [[1, 2, 3, 4, 5, 6]] type='1 * 6 * float64'>
In [9]: import uproot
In [10]: f = uproot.open("/Users/tamasgal/Downloads/test_array.root")
In [11]: f["structs/nInt"].array()
Out[11]: array([1], dtype=int32)
In [12]: f["structs/2x3mat"].array()
Out[12]: array([[1., 2., 3., 4., 5., 6.]])
That's weird—I'm surprised this file is failing in not just one Uproot, but both. It is a type that I've addressed and have tests for. Normally, I'd say that Uproot would be a good model to follow for this case, but apparently not.
What it ought to be doing is reading the fixed-size dimensions from the TLeaf fTitle and instructing NumPy to expect dtype("f8", (2, 3))
(a dtype with shape information baked into it, for all dimensions but the first of the array (1 entry in this case)). Then it should expect an entry's itemsize to be 6*8 bytes, rather than 8 bytes.
I'm glad it got there leaf-list as a NumPy structured array (which was then converted into Awkward). That's a similar thing: the fields are baked into a dtype. It's a different way of reading than data split among branches.
For better debugging, you may want to pass library="np"
to the array
method, so that these are not also converted into NumPy. For jagged arrays, library="np"
is more conversation, not less, but it's the opposite for these rectilinear arrays.
I just caught up to your last example—that "structs/2x3mat"
is not a leaf-list/structured array (fields baked into dtype), it's a fixed-size multidimensional array (shape baked into dtype). Is it just the case with two fixed-size indexes that isn't working? I don't think I have tests for that, and it could be failing. In that case, Uproot is a good model to follow—there's code to read that handles that, but it's apparently broken in some cases.
Oh dear, you are fast 😅
Alright, thanks! I will dig deeper... I just created an issue in uproot4
, to follow the standard procedures.
I just caught up to your last example—that
"structs/2x3mat"
is not a leaf-list/structured array (fields baked into dtype), it's a fixed-size multidimensional array (shape baked into dtype). Is it just the case with two fixed-size indexes that isn't working? I don't think I have tests for that, and it could be failing. In that case, Uproot is a good model to follow—there's code to read that handles that, but it's apparently broken in some cases.
Yes, so the same data is saved in structs
and in arrays
, former is a real struct, latter is just a simple array. We can continue in https://github.com/scikit-hep/uproot4/issues/74
I am not at a PC currently but I guess this is fixed too by #22. So if someone has time to check this could be closed too.
PS: @Moelf thanks a lot for the effort in the recent days/weeks. I can finally switch back to UnROOT now and will let you know if I run into new issues.
I just looked at a bit, sadly no. Your branch is extra jagged by the fact that each element is higher-dimensional and they are certainly some kind of nested vector in C++? I've never seen matrix branch in my (LHC experiment) life sorry, but I will be happy to learn from Jim later on!
Okay, seems reasonable for the arrays
tree, since this really contains a basic two-dimensional C-Array. This probably complicates things a bit.
However the struct
array is basically a wrapped one-dimensional C-array and this should not be different than what was used in #22. So this is interesting, that it still not works. I try to have a look at that tomorrow.
PS: In experiment there is probably no reason to use higher-dimensional arrays. But for me as a theoretician I am storing a lot of matrix-valued observables from Lattice-simulations and reading these results back into Julia would be really nice 😊
My current workaround uses the Julia read
- and pipeline
-functions to get the values from the ROOT-CLI, which is of course pretty slow and messy.