uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

Need tool to get dask dataframes from config files

Open acampove opened this issue 7 months ago • 11 comments

Hi,

I am looking for something like:

ddfg = DDataFrameGetter(config_path='file.yaml')
ddf  = ddfg.get_dataframe()

where file.yaml has:

tree   : tree_name
primary_keys:
  - index
files :
  - file_001.root
  - file_002.root
  - file_003.root
samples:
  - /tmp/tests/dmu/rfile/main
  - /tmp/tests/dmu/rfile/frnd

The primary keys, for real HEP data would be EVENTNUMBER, RUNNUMBER, etc.

Cheers.

acampove avatar May 18 '25 10:05 acampove

After some research, it seems that we (the users) can remove completely ROOT DataFrames from our workflow by using dask. This means that this getter I am showing you should be implemented with an even higher priority.

FYI, ROOT dataframes are not python friendly and are plagued with bugs. I would rather use something that has been more fully tested, is python friendly and even provides to the students more useful skills.

acampove avatar May 18 '25 12:05 acampove

In any case, implementing this myself was a matter of a couple of hours, you can see:

documentation
class
test

and you can take whatever you need from there.

Cheers.

acampove avatar May 18 '25 12:05 acampove

Hi @acampove , as ROOT DataFrames are ROOT (uproot is independent of it), this won't work, but uproot can go directly to dask, see here. This is probably what you're looking for?

jonas-eschle avatar May 20 '25 08:05 jonas-eschle

Hi @acampove , as ROOT DataFrames are ROOT (uproot is independent of it), this won't work, but uproot can go directly to dask, see here. This is probably what you're looking for?

Hello @jonas-eschle ,

Thanks for your comment. I was not specific enough, my purpose was to do precisely that, to go directly from ROOT files to DaskDataFrames, that is why my tool is called DDataFrameGetter, it is a Dask dataframe getter.

There seems to be a function to do that in uproot. However not with friend trees. Basically, I was looking for the non-ROOT equivalent of what is in the Distributed From Spec here. However I do not want to be left with a ROOT dataframe, I want something more python friendly.

Cheers.

acampove avatar May 20 '25 08:05 acampove

Ok, I am back to this. I am not sure if I understand the documentation here

From:

uproot.dask(root_file)
dask.awkward<from-uproot, npartitions=1>

uproot.dask(root_file,library='np')
{'Type': dask.array<Type-from-uproot, shape=(2304,), dtype=object, chunksize=(2304,), chunktype=numpy.ndarray>, ...}

it seems that I have to either pick:

  • An awkward array. That I do not really have any intention of using because it will bring a learning curve that I am not willing to go through. I already know how to use pandas dataframes and learning that already took time.
  • A dictionary of dask arrays? And I am not sure why would I want that, should I use my own code to merge those arrays somehow?

This design of yours puzzles me. As a user, what I want is something simple:

ROOT TTree -> Dask DataFrame

My branches are not jagged arrays, they are just floats, ints or bools. It seems that the option that would achieve that is library='pd', which is precisely the option you haven't implemented.

acampove avatar May 31 '25 12:05 acampove

@acampove - thanks for following up on this issue. Perhaps you could use https://awkward-array.org/doc/main/reference/generated/ak.to_dataframe.html

ianna avatar Jun 01 '25 03:06 ianna

That's true, and I am not sure why it is not implemented.

it seems that I have to either pick:

  • An awkward array. That I do not really have any intention of using because it will bring a learning curve that I am not willing to go through. I already know how to use pandas dataframes and learning that already took time.

Awkward arrays are a generalization of numpy arrays, they behave very similar in most cases. It's not an either-or, it's rather that awkward can handle cases that pandas can't (ragged arrays). Which you don't need, so pure np should be sufficient.

  • A dictionary of dask arrays? And I am not sure why would I want that, should I use my own code to merge those arrays somehow?

However, a lot of the time is just volunteers like us, summer-students etc that people implement features here, and since you have a good understanding of DataFrames and Dask it seems, maybe you want to take a shot at it? In general, I don't imagine this to be too complicated, as a dict of arrays -> pandas is trivial and dask does support pandas (note that it's not actual pandas, but dask pandas that should be use AIFAIU). Would you wanna give it a try? Maybe figure out why it's not there or what would be needed?

This design of yours puzzles me. As a user, what I want is something simple:

ROOT TTree -> Dask DataFrame

I would just clarify that you as a user want this (well, me personally too ;) ), however, many don't. Awkward arrays are most often used (from my experience) to process jagged arrays. So the focus was rather on getting awkward in.

jonas-eschle avatar Jun 02 '25 08:06 jonas-eschle

@acampove library="pd" has "kinda" been implemented but it doesn't work with dask>= 2025 because the dask dataframe constructor changed. You can use dask-awkward to convert the dask-awkward array into a dask dataframe. An example with a flat tree below. The dak.to_dataframe will raise an error if you have dask >= 2025. It should work fine with 2024.12.*.

In [1]: from skhep_testdata import data_path

In [2]: import uproot

In [3]: import dask_awkward as dak

In [4]: filename = data_path("uproot-small-flat-tree.root")

In [5]: uproot.dask(filename).compute()
Out[5]: <Array [{Int32: 0, Int64: 0, ...}, ..., {...}] type='100 * {Int32: int32, I...'>

In [6]: dakarray = uproot.dask(filename)

In [7]: dakarray
Out[7]: dask.awkward<from-uproot, npartitions=1>

In [8]: df = dak.to_dataframe(dakarray)

In [9]: df
Out[9]:
Dask DataFrame Structure:
               Int32  Int64  UInt32  UInt64  Float32  Float64     Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64      N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
npartitions=1
               int32  int64  uint32  uint64  float32  float64  string      int32      int64      uint32      uint64      float32      float64  int32      int32      int64      uint32      uint64      float32      float64
                 ...    ...     ...     ...      ...      ...     ...        ...        ...         ...         ...          ...          ...    ...        ...        ...         ...         ...          ...          ...
Dask Name: to_pyarrow_string, 3 graph layers

In [10]: df.compute()
Out[10]:
                Int32  Int64  UInt32  UInt64  Float32  Float64      Str  ArrayInt32  ArrayInt64  ArrayUInt32  ArrayUInt64  ArrayFloat32  ArrayFloat64  N  SliceInt32  SliceInt64  SliceUInt32  SliceUInt64  SliceFloat32  SliceFloat64
entry subentry
1     0             1      1       1       1      1.0      1.0  evt-001           1           1            1            1           1.0           1.0  1           1           1            1            1           1.0           1.0
2     0             2      2       2       2      2.0      2.0  evt-002           2           2            2            2           2.0           2.0  2           2           2            2            2           2.0           2.0
      1             2      2       2       2      2.0      2.0  evt-002           2           2            2            2           2.0           2.0  2           2           2            2            2           2.0           2.0
3     0             3      3       3       3      3.0      3.0  evt-003           3           3            3            3           3.0           3.0  3           3           3            3            3           3.0           3.0
      1             3      3       3       3      3.0      3.0  evt-003           3           3            3            3           3.0           3.0  3           3           3            3            3           3.0           3.0
...               ...    ...     ...     ...      ...      ...      ...         ...         ...          ...          ...           ...           ... ..         ...         ...          ...          ...           ...           ...
99    4            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      5            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      6            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      7            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      8            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0

[450 rows x 20 columns]

In [11]: dakarray = uproot.dask(filename, steps_per_file=4)

In [12]: dakarray
Out[12]: dask.awkward<from-uproot, npartitions=4>

In [13]: df = dak.to_dataframe(dakarray)

In [14]: df
Out[14]:
Dask DataFrame Structure:
               Int32  Int64  UInt32  UInt64  Float32  Float64     Str ArrayInt32 ArrayInt64 ArrayUInt32 ArrayUInt64 ArrayFloat32 ArrayFloat64      N SliceInt32 SliceInt64 SliceUInt32 SliceUInt64 SliceFloat32 SliceFloat64
npartitions=4
               int32  int64  uint32  uint64  float32  float64  string      int32      int64      uint32      uint64      float32      float64  int32      int32      int64      uint32      uint64      float32      float64
                 ...    ...     ...     ...      ...      ...     ...        ...        ...         ...         ...          ...          ...    ...        ...        ...         ...         ...          ...          ...
                 ...    ...     ...     ...      ...      ...     ...        ...        ...         ...         ...          ...          ...    ...        ...        ...         ...         ...          ...          ...
                 ...    ...     ...     ...      ...      ...     ...        ...        ...         ...         ...          ...          ...    ...        ...        ...         ...         ...          ...          ...
                 ...    ...     ...     ...      ...      ...     ...        ...        ...         ...         ...          ...          ...    ...        ...        ...         ...         ...          ...          ...
Dask Name: to_pyarrow_string, 3 graph layers

In [15]: df.compute()
Out[15]:
                Int32  Int64  UInt32  UInt64  Float32  Float64      Str  ArrayInt32  ArrayInt64  ArrayUInt32  ArrayUInt64  ArrayFloat32  ArrayFloat64  N  SliceInt32  SliceInt64  SliceUInt32  SliceUInt64  SliceFloat32  SliceFloat64
entry subentry
1     0             1      1       1       1      1.0      1.0  evt-001           1           1            1            1           1.0           1.0  1           1           1            1            1           1.0           1.0
2     0             2      2       2       2      2.0      2.0  evt-002           2           2            2            2           2.0           2.0  2           2           2            2            2           2.0           2.0
      1             2      2       2       2      2.0      2.0  evt-002           2           2            2            2           2.0           2.0  2           2           2            2            2           2.0           2.0
3     0             3      3       3       3      3.0      3.0  evt-003           3           3            3            3           3.0           3.0  3           3           3            3            3           3.0           3.0
      1             3      3       3       3      3.0      3.0  evt-003           3           3            3            3           3.0           3.0  3           3           3            3            3           3.0           3.0
...               ...    ...     ...     ...      ...      ...      ...         ...         ...          ...          ...           ...           ... ..         ...         ...          ...          ...           ...           ...
24    4            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      5            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      6            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      7            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0
      8            99     99      99      99     99.0     99.0  evt-099          99          99           99           99          99.0          99.0  9          99          99           99           99          99.0          99.0

[450 rows x 20 columns]

ikrommyd avatar Jun 02 '25 09:06 ikrommyd

@martindurant, could you confirm if uproot.dask -> dak.to_dataframe is a valid method to get a dask dataframe from a root file? This shouldn't do any unnecessary data reading right? It seems not from reading the code block: https://github.com/dask-contrib/dask-awkward/blob/75d19992c468c10e090f1fc959db422aa54571e0/src/dask_awkward/lib/io/io.py#L446-L496

ikrommyd avatar Jun 02 '25 10:06 ikrommyd

Right, the design is that column optimisation will happen at the time of the to_dataframe call, so any columns still in the object at that point will always get read, whether or not they get used downstream. Dataframes themselves do not provide this optimisation (or, in expr they might, but it doesn't integrate with what we have).

Note that to_dataframe is broken in recent dask, so you probably need the December version.

martindurant avatar Jun 02 '25 13:06 martindurant

In my opinion, dataframes do not describe HEP data well and anything you want to do with a dataframe, you can probably do with awkward arrays. Personally, If I had to deal with root files and wanted to use dask, I would just use uproot.dask and dask_awkward directly and never try to convert into a dask dataframe.

ikrommyd avatar Jun 02 '25 13:06 ikrommyd