pygama icon indicating copy to clipboard operation
pygama copied to clipboard

DataLoader feature requests

Open iguinn opened this issue 1 year ago • 11 comments

Pygama now has a DataLoader that has been extremely helpful in accessing data across multiple runs and different tiers. As this tool has gotten more users, people have also had ideas for new features that will be useful in the future for a variety of analysis tasks. This issue will be used to compile a list of ideas; as people have new ideas, post them here, and I will try to add them to this top-level post. If someone is interested in implementing one or more of these, I can also put your name next to one of these. We can also discuss specifics of implementation in the comments below.

  • [x] Optional argument to append or overwrite when adding data selection with setters (Grace; #493)
  • [x] Get data as iterators (currently implemented in two different ways; need to settle on how to best do this) (Ian/Luigi)
  • [ ] Create template waveforms over a selection
  • [ ] Randomly sample waveforms within a selection
  • [ ] ~~Access to metadata and analysis parameters~~
  • [ ] "Lazy" construction of the entry list for use with iterators and waveform browsers
  • [x] Check performance vs loading files on their own; implement optimizations if possible -> #521
  • [ ] Add clearer error messages

iguinn avatar May 15 '23 20:05 iguinn

I also think that, since many new features will come, we should think about how to refactor the code to make it more readable. It's very hard at the moment for new people to understand the code and contribute (data_loader.py is huge!)

gipert avatar May 16 '23 08:05 gipert

We should also evaluate the performance and make sure that using the data loader does not add significant overhead to just manually loading the data from a known list of files (with and without filesystem caching)

gipert avatar May 16 '23 08:05 gipert

Another thing I'll add to the list is clearer error messages, and sanity checks as things are loading (with clear error messages)

iguinn avatar May 16 '23 16:05 iguinn

Also: .load() currently fails if the provided entry list (Pandas DataFrame) is not properly indexed (e.g. if it has been previously filtered). This can be fixed on the user side by calling .reset_index(drop=True) on the entry list, but it should be done internally.

Alternatively, the code should be made independent on the DataFrame indices.

gipert avatar May 17 '23 12:05 gipert

For accessing analysis paramters such as the filter parameters or calibration results, do we want that to be handled by the DataLoader or by pylegendmeta?

erin717 avatar May 17 '23 17:05 erin717

I would argue that we should try to avoid having LEGEND-specific stuff in the DataLoader, such as pylegendmeta.

gipert avatar May 18 '23 09:05 gipert

I agree. I ask because if that's the case, we should remove or rethink the fifth bullet on the list as it broaches beyond the scope of what the data loader should be doing.

erin717 avatar May 18 '23 14:05 erin717

Hi all! I spent a couple of days trying to use the data loader to load the SiPM data, and I want to report some feedback (I hope this can be useful for future improvements).

  • One problem in general (also for the germanium data) is that if you are trying to load a certain variable or cut on a certain variable that doesn’t exist, you don’t get any error message. I think it’s a problem, especially when you are applying some cuts, and you don’t know from the output if this is empty because you cut everything or because you are trying to look for something that doesn’t exist. (but I see there is already a similar point)
  • Similarly, if you try to load channels not in the files. An example is that I was trying to load some SiPM channels which are not in the hit files (Patrick has to fix an issue with the metadata here), and instead of getting an error, I was getting values in the arrays (energy_in_pe and trigger_pos) that didn’t make any sense (e.g. all nan in the energy and horrible numbers in the trigger_pos of the same event)
  • Last, working with the data loader for the SiPM data turned out to be quite complicated because of pandas not being so secure when using lists in cells (but here, I am not an expert at all, so I am not sure how much this is my fault or how much it could be made easier). Also, cuts on these variables cannot be set with set_cuts (or I did not find a way to).

bossioel avatar Jul 17 '23 16:07 bossioel

at the CD review today, one of the reviewers (Ren Cooper) recommend we look into "FastQuery" -- basically fast data loading with selections for HDF5 files: https://dav.lbl.gov/archive/Events/SC05/HDF5FastQuery/index.html

jasondet avatar Oct 25 '23 17:10 jasondet

Another suggestion from Chris Jillings: https://www.dcache.org

jasondet avatar Dec 12 '23 17:12 jasondet

I've been messing around a bit with the dataloader and view_as, and I think they will work really well together:

it = loader.load_iterator()
for tb, i, n in it:
    print(tb.view_as("pd")[:n])

produces an output of:

          trapEftp
0     14890.890625
1     16785.869141
2     13618.189453
3     11785.408203
4     16502.757812
...            ...
3195  18352.054688
3196  10460.607422
3197  11076.259766
3198  14756.632812
3199  17032.175781

[3200 rows x 1 columns]
          trapEftp
0     13427.619141
1     15362.182617
2     17098.033203
3     15356.582031
4     16032.803711
...            ...
3195  10453.760742
3196  14521.578125
3197  14786.087891
3198  13473.371094
3199  16450.130859
...

Which would make an iterative pandas analysis really easy (or awkward or numpy or whatever we expand view_as to use in the future).

I would propose that we keep load_iterator as an internal thing (and perhaps rename it to _load_iterator to hide it), and implement next to wrap load_iterator and use view_as to loop through however you want. This could look like:

for pd in loader.next("pd", options):
    # analyze dataframe

One issue is that the last entry is not going to have the same size as the others. In the above example, I solve this by making a slice of the view. This might not be feasible with every view_as we could imagine. For example, if we did view_as for histograms, we wouldn't be able to slice it in this way. I can think of a few possible solutions to this:

  1. Do it as above and restrict view_as and next to data structures that are truly just views
  2. Give view_as arguments to control the range, so in the loop we do view_as("pd", start=0, stop=n) or something like that
  3. Have the iterator resize the chunked tables; as implemented now, this would require reallocating the memory in the LGDO objects unless we added the ability to have a different capacity from size like in C++ vectors (this already exists for VectorOfVectors). This is somewhat high effort though.

Another issue is that the entrylist still has to get loaded before we can do load_iterator, which takes quite a while (and would be abysmally slow and probably have memory problems if we wanted to load, say, all calibrations in a partition). My proposal here would be to implement a lazy entrylist that calculates maybe one file at a time. If we gave it a list-like interface, it could still work very well with the LH5Iterator. This would also be a much more efficient way of dealing with file caching, I think.

iguinn avatar Jan 03 '24 21:01 iguinn