pygama
pygama copied to clipboard
DataLoader feature requests
Pygama now has a DataLoader that has been extremely helpful in accessing data across multiple runs and different tiers. As this tool has gotten more users, people have also had ideas for new features that will be useful in the future for a variety of analysis tasks. This issue will be used to compile a list of ideas; as people have new ideas, post them here, and I will try to add them to this top-level post. If someone is interested in implementing one or more of these, I can also put your name next to one of these. We can also discuss specifics of implementation in the comments below.
- [x] Optional argument to append or overwrite when adding data selection with setters (Grace; #493)
- [x] Get data as iterators (currently implemented in two different ways; need to settle on how to best do this) (Ian/Luigi)
- [ ] Create template waveforms over a selection
- [ ] Randomly sample waveforms within a selection
- [ ] ~~Access to metadata and analysis parameters~~
- [ ] "Lazy" construction of the entry list for use with iterators and waveform browsers
- [x] Check performance vs loading files on their own; implement optimizations if possible -> #521
- [ ] Add clearer error messages
I also think that, since many new features will come, we should think about how to refactor the code to make it more readable. It's very hard at the moment for new people to understand the code and contribute (data_loader.py
is huge!)
We should also evaluate the performance and make sure that using the data loader does not add significant overhead to just manually loading the data from a known list of files (with and without filesystem caching)
Another thing I'll add to the list is clearer error messages, and sanity checks as things are loading (with clear error messages)
Also: .load()
currently fails if the provided entry list (Pandas DataFrame) is not properly indexed (e.g. if it has been previously filtered). This can be fixed on the user side by calling .reset_index(drop=True)
on the entry list, but it should be done internally.
Alternatively, the code should be made independent on the DataFrame indices.
For accessing analysis paramters such as the filter parameters or calibration results, do we want that to be handled by the DataLoader or by pylegendmeta?
I would argue that we should try to avoid having LEGEND-specific stuff in the DataLoader, such as pylegendmeta.
I agree. I ask because if that's the case, we should remove or rethink the fifth bullet on the list as it broaches beyond the scope of what the data loader should be doing.
Hi all! I spent a couple of days trying to use the data loader to load the SiPM data, and I want to report some feedback (I hope this can be useful for future improvements).
- One problem in general (also for the germanium data) is that if you are trying to load a certain variable or cut on a certain variable that doesn’t exist, you don’t get any error message. I think it’s a problem, especially when you are applying some cuts, and you don’t know from the output if this is empty because you cut everything or because you are trying to look for something that doesn’t exist. (but I see there is already a similar point)
- Similarly, if you try to load channels not in the files. An example is that I was trying to load some SiPM channels which are not in the hit files (Patrick has to fix an issue with the metadata here), and instead of getting an error, I was getting values in the arrays (energy_in_pe and trigger_pos) that didn’t make any sense (e.g. all nan in the energy and horrible numbers in the trigger_pos of the same event)
- Last, working with the data loader for the SiPM data turned out to be quite complicated because of pandas not being so secure when using lists in cells (but here, I am not an expert at all, so I am not sure how much this is my fault or how much it could be made easier). Also, cuts on these variables cannot be set with set_cuts (or I did not find a way to).
at the CD review today, one of the reviewers (Ren Cooper) recommend we look into "FastQuery" -- basically fast data loading with selections for HDF5 files: https://dav.lbl.gov/archive/Events/SC05/HDF5FastQuery/index.html
Another suggestion from Chris Jillings: https://www.dcache.org
I've been messing around a bit with the dataloader and view_as, and I think they will work really well together:
it = loader.load_iterator()
for tb, i, n in it:
print(tb.view_as("pd")[:n])
produces an output of:
trapEftp
0 14890.890625
1 16785.869141
2 13618.189453
3 11785.408203
4 16502.757812
... ...
3195 18352.054688
3196 10460.607422
3197 11076.259766
3198 14756.632812
3199 17032.175781
[3200 rows x 1 columns]
trapEftp
0 13427.619141
1 15362.182617
2 17098.033203
3 15356.582031
4 16032.803711
... ...
3195 10453.760742
3196 14521.578125
3197 14786.087891
3198 13473.371094
3199 16450.130859
...
Which would make an iterative pandas analysis really easy (or awkward or numpy or whatever we expand view_as to use in the future).
I would propose that we keep load_iterator as an internal thing (and perhaps rename it to _load_iterator to hide it), and implement next to wrap load_iterator and use view_as to loop through however you want. This could look like:
for pd in loader.next("pd", options):
# analyze dataframe
One issue is that the last entry is not going to have the same size as the others. In the above example, I solve this by making a slice of the view. This might not be feasible with every view_as we could imagine. For example, if we did view_as for histograms, we wouldn't be able to slice it in this way. I can think of a few possible solutions to this:
- Do it as above and restrict view_as and next to data structures that are truly just views
- Give view_as arguments to control the range, so in the loop we do view_as("pd", start=0, stop=n) or something like that
- Have the iterator resize the chunked tables; as implemented now, this would require reallocating the memory in the LGDO objects unless we added the ability to have a different capacity from size like in C++ vectors (this already exists for VectorOfVectors). This is somewhat high effort though.
Another issue is that the entrylist still has to get loaded before we can do load_iterator, which takes quite a while (and would be abysmally slow and probably have memory problems if we wanted to load, say, all calibrations in a partition). My proposal here would be to implement a lazy entrylist that calculates maybe one file at a time. If we gave it a list-like interface, it could still work very well with the LH5Iterator. This would also be a much more efficient way of dealing with file caching, I think.