Need tool to get dask dataframes from config files
Hi,
I am looking for something like:
ddfg = DDataFrameGetter(config_path='file.yaml')
ddf = ddfg.get_dataframe()
where file.yaml has:
tree : tree_name
primary_keys:
- index
files :
- file_001.root
- file_002.root
- file_003.root
samples:
- /tmp/tests/dmu/rfile/main
- /tmp/tests/dmu/rfile/frnd
The primary keys, for real HEP data would be EVENTNUMBER, RUNNUMBER, etc.
Cheers.
After some research, it seems that we (the users) can remove completely ROOT DataFrames from our workflow by using dask. This means that this getter I am showing you should be implemented with an even higher priority.
FYI, ROOT dataframes are not python friendly and are plagued with bugs. I would rather use something that has been more fully tested, is python friendly and even provides to the students more useful skills.
In any case, implementing this myself was a matter of a couple of hours, you can see:
and you can take whatever you need from there.
Cheers.
Hi @acampove , as ROOT DataFrames are ROOT (uproot is independent of it), this won't work, but uproot can go directly to dask, see here. This is probably what you're looking for?
Hi @acampove , as ROOT DataFrames are ROOT (uproot is independent of it), this won't work, but uproot can go directly to dask, see here. This is probably what you're looking for?
Hello @jonas-eschle ,
Thanks for your comment. I was not specific enough, my purpose was to do precisely that, to go directly from ROOT files to DaskDataFrames, that is why my tool is called DDataFrameGetter, it is a Dask dataframe getter.
There seems to be a function to do that in uproot. However not with friend trees. Basically, I was looking for the non-ROOT equivalent of what is in the Distributed From Spec here. However I do not want to be left with a ROOT dataframe, I want something more python friendly.
Cheers.
Ok, I am back to this. I am not sure if I understand the documentation here
From:
uproot.dask(root_file)
dask.awkward<from-uproot, npartitions=1>
uproot.dask(root_file,library='np')
{'Type': dask.array<Type-from-uproot, shape=(2304,), dtype=object, chunksize=(2304,), chunktype=numpy.ndarray>, ...}
it seems that I have to either pick:
- An
awkwardarray. That I do not really have any intention of using because it will bring a learning curve that I am not willing to go through. I already know how to use pandas dataframes and learning that already took time. - A dictionary of dask arrays? And I am not sure why would I want that, should I use my own code to merge those arrays somehow?
This design of yours puzzles me. As a user, what I want is something simple:
ROOT TTree -> Dask DataFrame
My branches are not jagged arrays, they are just floats, ints or bools. It seems that the option that would achieve that is library='pd', which is precisely the option you haven't implemented.
@acampove - thanks for following up on this issue. Perhaps you could use https://awkward-array.org/doc/main/reference/generated/ak.to_dataframe.html
That's true, and I am not sure why it is not implemented.
it seems that I have to either pick:
- An
awkwardarray. That I do not really have any intention of using because it will bring a learning curve that I am not willing to go through. I already know how to use pandas dataframes and learning that already took time.
Awkward arrays are a generalization of numpy arrays, they behave very similar in most cases. It's not an either-or, it's rather that awkward can handle cases that pandas can't (ragged arrays). Which you don't need, so pure np should be sufficient.
- A dictionary of dask arrays? And I am not sure why would I want that, should I use my own code to merge those arrays somehow?
However, a lot of the time is just volunteers like us, summer-students etc that people implement features here, and since you have a good understanding of DataFrames and Dask it seems, maybe you want to take a shot at it? In general, I don't imagine this to be too complicated, as a dict of arrays -> pandas is trivial and dask does support pandas (note that it's not actual pandas, but dask pandas that should be use AIFAIU). Would you wanna give it a try? Maybe figure out why it's not there or what would be needed?
This design of yours puzzles me. As a user, what I want is something simple:
ROOT TTree -> Dask DataFrame
I would just clarify that you as a user want this (well, me personally too ;) ), however, many don't. Awkward arrays are most often used (from my experience) to process jagged arrays. So the focus was rather on getting awkward in.
@acampove library="pd" has "kinda" been implemented but it doesn't work with dask>= 2025 because the dask dataframe constructor changed. You can use dask-awkward to convert the dask-awkward array into a dask dataframe. An example with a flat tree below. The dak.to_dataframe will raise an error if you have dask >= 2025. It should work fine with 2024.12.*.
In [1]: from skhep_testdata import data_path
In [2]: import uproot
In [3]: import dask_awkward as dak
In [4]: filename = data_path("uproot-small-flat-tree.root")
In [5]: uproot.dask(filename).compute()
Out[5]: <Array [{Int32: 0, Int64: 0, ...}, ..., {...}] type='100 * {Int32: int32, I...'>
In [6]: dakarray = uproot.dask(filename)
In [7]: dakarray
Out[7]: dask.awkward<from-uproot, npartitions=1>
In [8]: df = dak.to_dataframe(dakarray)
In [9]: df
Out[9]:
Dask DataFrame Structure:
Int32 Int64 UInt32 UInt64 Float32 Float64 Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64 N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
npartitions=1
int32 int64 uint32 uint64 float32 float64 string int32 int64 uint32 uint64 float32 float64 int32 int32 int64 uint32 uint64 float32 float64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_pyarrow_string, 3 graph layers
In [10]: df.compute()
Out[10]:
Int32 Int64 UInt32 UInt64 Float32 Float64 Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64 N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
entry subentry
1 0 1 1 1 1 1.0 1.0 evt-001 1 1 1 1 1.0 1.0 1 1 1 1 1 1.0 1.0
2 0 2 2 2 2 2.0 2.0 evt-002 2 2 2 2 2.0 2.0 2 2 2 2 2 2.0 2.0
1 2 2 2 2 2.0 2.0 evt-002 2 2 2 2 2.0 2.0 2 2 2 2 2 2.0 2.0
3 0 3 3 3 3 3.0 3.0 evt-003 3 3 3 3 3.0 3.0 3 3 3 3 3 3.0 3.0
1 3 3 3 3 3.0 3.0 evt-003 3 3 3 3 3.0 3.0 3 3 3 3 3 3.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ...
99 4 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
5 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
6 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
7 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
8 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
[450 rows x 20 columns]
In [11]: dakarray = uproot.dask(filename, steps_per_file=4)
In [12]: dakarray
Out[12]: dask.awkward<from-uproot, npartitions=4>
In [13]: df = dak.to_dataframe(dakarray)
In [14]: df
Out[14]:
Dask DataFrame Structure:
Int32 Int64 UInt32 UInt64 Float32 Float64 Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64 N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
npartitions=4
int32 int64 uint32 uint64 float32 float64 string int32 int64 uint32 uint64 float32 float64 int32 int32 int64 uint32 uint64 float32 float64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_pyarrow_string, 3 graph layers
In [15]: df.compute()
Out[15]:
Int32 Int64 UInt32 UInt64 Float32 Float64 Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64 N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
entry subentry
1 0 1 1 1 1 1.0 1.0 evt-001 1 1 1 1 1.0 1.0 1 1 1 1 1 1.0 1.0
2 0 2 2 2 2 2.0 2.0 evt-002 2 2 2 2 2.0 2.0 2 2 2 2 2 2.0 2.0
1 2 2 2 2 2.0 2.0 evt-002 2 2 2 2 2.0 2.0 2 2 2 2 2 2.0 2.0
3 0 3 3 3 3 3.0 3.0 evt-003 3 3 3 3 3.0 3.0 3 3 3 3 3 3.0 3.0
1 3 3 3 3 3.0 3.0 evt-003 3 3 3 3 3.0 3.0 3 3 3 3 3 3.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ...
24 4 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
5 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
6 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
7 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
8 99 99 99 99 99.0 99.0 evt-099 99 99 99 99 99.0 99.0 9 99 99 99 99 99.0 99.0
[450 rows x 20 columns]
@martindurant, could you confirm if uproot.dask -> dak.to_dataframe is a valid method to get a dask dataframe from a root file? This shouldn't do any unnecessary data reading right? It seems not from reading the code block: https://github.com/dask-contrib/dask-awkward/blob/75d19992c468c10e090f1fc959db422aa54571e0/src/dask_awkward/lib/io/io.py#L446-L496
Right, the design is that column optimisation will happen at the time of the to_dataframe call, so any columns still in the object at that point will always get read, whether or not they get used downstream. Dataframes themselves do not provide this optimisation (or, in expr they might, but it doesn't integrate with what we have).
Note that to_dataframe is broken in recent dask, so you probably need the December version.
In my opinion, dataframes do not describe HEP data well and anything you want to do with a dataframe, you can probably do with awkward arrays. Personally, If I had to deal with root files and wanted to use dask, I would just use uproot.dask and dask_awkward directly and never try to convert into a dask dataframe.