pygama DataLoader performance is bad?

At the analysis workshop, it was argued that the slowness of the DataLoader was only due to the IO speed. When I loaded the data with "low-level routines", I never felt it was that slow. To test this quantitatively, I wrote a quick script to compare low-level loading with DataLoader loading.

The script is:

import lgdo.lh5_store as store
from legendmeta import LegendMetadata
import re
import glob
import numpy as np
import time
from pygama.flow import DataLoader

def low_level_load(period="p03",run="r000",ndets=10):
    start = time.time()
    meta_path = "/data2/public/prodenv/prod-blind/ref/v01.06/inputs"
    f_hit = sorted(glob.glob("/data2/public/prodenv/prod-blind/ref/v01.06/generated/tier/hit/phy/"+period+"/"+run+"/*.lh5"))
    f_tcm = [e.replace("hit", "tcm") for e in f_hit]
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    lmeta = LegendMetadata(path=meta_path)
    chmap = lmeta.channelmap(re.search(r"\d{8}T\d{6}Z", f_hit[0]).group(0))
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    # load TCM data to define an event
    nda = store.load_nda(f_tcm, ["array_id", "array_idx","cumulative_length"], "hardware_tcm_1/")
    clt = nda["cumulative_length"]
    split = clt[np.diff(clt,append=[clt[-1]])<0]
    ids = np.split(nda["array_id"],np.cumsum(split))
    idx = np.split(nda["array_idx"],np.cumsum(split))
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    for ch in geds:
        idx_ch = [idx[i][ids[i] == ch] for i in range(len(idx))]
        nda = store.load_nda(f_hit,['cuspEmax_ctc_cal','AoE_Classifier'],f"ch{ch}/hit/",idx_list=idx_ch)
        mask = nda['cuspEmax_ctc_cal'] > 25
        nda['AoE_Classifier'][mask]

    dt_end = time.time() - start -dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")


def data_loader_load(period="p03",run="r000",ndets=10):
    start = time.time()
    prodenv = "/data2/public/prodenv"
    dl = DataLoader(f"{prodenv}/prod-blind/ref/v01.06[setups/l200/dataloader]")
    file_query = f"period == '{period}' and run == '{run}' and datatype == 'phy'"
    dl.reset()
    dl.data_dir = prodenv
    dl.set_files(file_query)
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    first_key = dl.get_file_list().iloc[0].timestamp
    lmeta = LegendMetadata(f"{prodenv}/prod-blind/ref/v01.06/inputs")
    chmap = lmeta.channelmap(on=first_key)  # get the channel map
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dl.set_datastreams(geds, "ch")
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    dl.set_cuts({"hit": "trapEmax_ctc_cal > 25"})
    dl.set_output(columns=["AoE_Classifier"], fmt="lgdo.Table")
    geds_el = dl.build_entry_list(tcm_level="tcm")
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    dl.load(geds_el)
    dt_end = time.time() - start - dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")

if __name__ == "__main__":
    print("Try low level:")
    low_level_load()
    print("Try Data loader:")
    data_loader_load()

First, I would ask you if you can find any unfair treatments in one or the other routine. Booth routines should:

Use data from period 03 run 000 for 10 detectors
Set a cut on the energy > 25 keV
Load the AoE values for the 10 geds where the cut is valid according to the TCM

The result on LNGS of the script is:

> python speed_test.py 
Try low level:
time to located files:   0.003 s
time to load meta:       0.931 s
time to load tcm:        1.605 s
time to load data:       10.845 s
Try Data loader:
time to located files:   0.918 s
time to load meta:       0.763 s
time to load tcm:        172.463 s
time to load data:       7.972 s

Oct 27 '23 17:10 patgo25

what is going on with the tcm loading!

okay thanks for testing, will check this asap.

Oct 30 '23 15:10 jasondet

It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and https://github.com/h5py/h5py/issues/1597

We're going to try the same workaround in read_object in order to speed up calls when an idx is provided.

Nov 01 '23 18:11 SamuelBorden

Where do we do indexed reading in the data loader other than in .load(), when an entry list is used?

Nov 02 '23 14:11 gipert

the idx read appears to have been a red herring. I found a factor of ~3 speed up for that tcm read step (really: build entry list) in data_loader and another factor of ~2 in lgdo.table.get_dataframe. However the overall read is still a factor of ~3 slower than Patrick's low-level read:

Singularity> python kraus_speed_test.py
Try low level:
time to located files:   0.003 s
time to load meta:       1.050 s
time to load tcm:        1.632 s
time to load data:       9.426 s
Try Data loader:
time to located files:   0.726 s
time to load meta:       0.891 s
time to load tcm:        26.584 s
time to load data:       7.684 s

There is a lot of complexity in data_loader having to do with dealing with multi-level cuts and other generalizations, but I hope it can be sped up more with some refactoring. It looks like there is still a lot going on in the inner loops.

Nov 03 '23 07:11 jasondet

Nice! Performance of the LGDO conversion methods is also a topic for https://github.com/legend-exp/legend-pydataobj/pull/30 by @MoritzNeuberger.

Did anyone try profiling the code to spot straightforward bottlenecks?

Nov 03 '23 09:11 gipert

pygama pygama copied to clipboard

DataLoader performance is bad?

pygama
pygama copied to clipboard