root icon indicating copy to clipboard operation
root copied to clipboard

Metadata availability in dataframes

Open acampove opened this issue 7 months ago • 6 comments

Explain what you would like to see improved and how.

Hi,

I am trying to debug a problem, (maybe in RDataFrame...) and I have built a dataframe from a set of files using FromSpec. The problem seems to emerge for certain subsets of entries. Once I am in those entries, I would like to know more about the source of those entries, like the file name/path. Ideally I would do something like:

names = rdf.GetFileNames()

and this would tell me from what files this particular dataframe has data. The usecase is that the dataframe might be made of hundreds of files and we might have an issue in some sections, corresponding to certain files. The problem could be identified like:

rdf_bad = rdf_all.Filter('a == -999')

Does anything like this exist? I think it makes sense to think of implementing stuff like this to ease debugging.

Cheers.

ROOT version

NA

Installation method

NA

Operating system

NA

Additional context

No response

acampove avatar May 27 '25 07:05 acampove

Hi,

For debugging purposes, you can increase ROOT's verbosity and check what file was opened, e.g. ROOTDEBUG=1 myRdfProgram: does this work for you?

dpiparo avatar May 27 '25 07:05 dpiparo

Hi,

For debugging purposes, you can increase ROOT's verbosity and check what file was opened, e.g. ROOTDEBUG=1 myRdfProgram: does this work for you?

Hi,

No, that does not work, that shows me all the files, not the ones my dataframe range is targetting and which have the problem. In order to find the problematic entries, I need to:

  1. Make a large dataframe from potentially hundreds of files.
  2. Use Filter or some sort of method to narrow things down to the faulty entries.
  3. Find out what files those entries belong to.

Cheers.

acampove avatar May 27 '25 07:05 acampove

Ah I see. You can always interrupt the processing and check the last opened, as a temporary kludge.

dpiparo avatar May 27 '25 07:05 dpiparo

OK, I found the source of the problem. For simplicity, I will only show a small subset of the files I am working on and I will show a YAML version (which is human friendly and I believe you should try to move there from JSON). My config is:

samples:
  main:
    trees:
      - DecayTree
    files:
      - /home/acampove/external_ssd/Data/main/v10/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_032a6393c1.root
      - /home/acampove/external_ssd/Data/main/v10/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_0718aef4bd.root
      - /home/acampove/external_ssd/Data/main/v10/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_09bb3e4e6b.root
      - /home/acampove/external_ssd/Data/main/v10/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_1eba31592f.root
friends:
  mva:
    trees:
      - DecayTree
    files:
      - /home/acampove/external_ssd/Data/mva/v5/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_032a6393c1.root
      - /home/acampove/external_ssd/Data/mva/v5/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_0718aef4bd.root
      - /home/acampove/external_ssd/Data/mva/v5/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_09bb3e4e6b.root
      - /home/acampove/external_ssd/Data/mva/v5/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_1eba31592f.root
  brem_track_2:
    trees:
      - DecayTree
    files:
      - /home/acampove/external_ssd/Data/brem_track_2/v3/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_032a6393c1.root
      - /home/acampove/external_ssd/Data/brem_track_2/v3/mc_magdown_11453001_bd_jpsix_ee_eq_jpsiinacc_Hlt2RD_BuToKpEE_MVA_0718aef4bd.root

There you can see:

  1. I have a main tree and 2 friends.
  2. All the trees are called the same.
  3. All the files are called the same, the only difference is the directory where they reside.
  4. They are sorted in the same way.

The bug was that the brem_track_2 sample has missing files. Likely what happened was that the entries from this tree were padded with the last available entry. This in my plots showed up as a spike. I have added a check in my code to prevent this from happening.

On your side, you might want to also enforce a simple, uniform, structure like the one above and implement checks that would raise exceptions when:

  1. The number of files is different.
  2. The names of the trees are different
  3. The order of the files is not the same
  4. Corresponding files have different number of entries.

But of course, that's up to you. Without those checks on your side, I will have to do them on mine. I tried to use a primary key in the past, but I saw crashes that unfortunately I do not have time to go further into.

Cheers.

acampove avatar May 27 '25 09:05 acampove

Maybe related: https://root-forum.cern.ch/t/rdataframe-how-to-use-sample-meta-information-in-definepersample/63116/2?u=ferhue

ferdymercury avatar May 27 '25 09:05 ferdymercury

Dear @acampove ,

Thanks for reaching out! You can always get which file the current entry belongs to via DefinePerSample, in particular if you look at RSampleInfo you can use the AsString() method to get the full sample path, i.e. file_name/path_to_tree, for each sample.

Now, to add more information to your latest comment:

On your side, you might want to also enforce a simple, uniform, structure like the one above and implement checks that would raise exceptions when:

  1. The number of files is different.
  2. The names of the trees are different
  3. The order of the files is not the same
  4. Corresponding files have different number of entries.

Points 1,2 and 3 are actually very valid use cases, so your request would invalidate workflows from CMS, ATLAS and who knows how many other users from LHC experiments or beyond. Since we have to support everyone, this cannot be done. On the other hand, one could think about supporting this behaviour opt-in, this would require some option either in the specification itself or in the FromSpec method. As usual, we are open to receiving suggestions from you as from any other user that could benefit the community.

Point 4 would involve a huge runtime penalty, imagine having tens of friends each with thousands of files, this would involve a full traversal of the dataset, sequentially, to ensure that every N-tuple of corresponding files in the specification has the same number of entries. Coupling this with remote reads, it could easily take multiple tens of minutes to do this even before starting the real analysis. Thus, I would say that also in this case the behaviour could only ever be opt-in, if strictly required by a user who would accept the huge runtime penalty.

vepadulano avatar May 29 '25 16:05 vepadulano