pygama
pygama copied to clipboard
DataLoader performance is bad?
At the analysis workshop, it was argued that the slowness of the DataLoader was only due to the IO speed. When I loaded the data with "low-level routines", I never felt it was that slow. To test this quantitatively, I wrote a quick script to compare low-level loading with DataLoader loading.
The script is:
import lgdo.lh5_store as store
from legendmeta import LegendMetadata
import re
import glob
import numpy as np
import time
from pygama.flow import DataLoader
def low_level_load(period="p03",run="r000",ndets=10):
start = time.time()
meta_path = "/data2/public/prodenv/prod-blind/ref/v01.06/inputs"
f_hit = sorted(glob.glob("/data2/public/prodenv/prod-blind/ref/v01.06/generated/tier/hit/phy/"+period+"/"+run+"/*.lh5"))
f_tcm = [e.replace("hit", "tcm") for e in f_hit]
dt_files = time.time() - start
print(f"time to located files: \t {dt_files:.3f} s")
lmeta = LegendMetadata(path=meta_path)
chmap = lmeta.channelmap(re.search(r"\d{8}T\d{6}Z", f_hit[0]).group(0))
geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
dt_meta = time.time() - start - dt_files
print(f"time to load meta: \t {dt_meta:.3f} s")
# load TCM data to define an event
nda = store.load_nda(f_tcm, ["array_id", "array_idx","cumulative_length"], "hardware_tcm_1/")
clt = nda["cumulative_length"]
split = clt[np.diff(clt,append=[clt[-1]])<0]
ids = np.split(nda["array_id"],np.cumsum(split))
idx = np.split(nda["array_idx"],np.cumsum(split))
dt_tcm = time.time() -start - dt_meta
print(f"time to load tcm: \t {dt_tcm:.3f} s")
for ch in geds:
idx_ch = [idx[i][ids[i] == ch] for i in range(len(idx))]
nda = store.load_nda(f_hit,['cuspEmax_ctc_cal','AoE_Classifier'],f"ch{ch}/hit/",idx_list=idx_ch)
mask = nda['cuspEmax_ctc_cal'] > 25
nda['AoE_Classifier'][mask]
dt_end = time.time() - start -dt_tcm
print(f"time to load data: \t {dt_end:.3f} s")
def data_loader_load(period="p03",run="r000",ndets=10):
start = time.time()
prodenv = "/data2/public/prodenv"
dl = DataLoader(f"{prodenv}/prod-blind/ref/v01.06[setups/l200/dataloader]")
file_query = f"period == '{period}' and run == '{run}' and datatype == 'phy'"
dl.reset()
dl.data_dir = prodenv
dl.set_files(file_query)
dt_files = time.time() - start
print(f"time to located files: \t {dt_files:.3f} s")
first_key = dl.get_file_list().iloc[0].timestamp
lmeta = LegendMetadata(f"{prodenv}/prod-blind/ref/v01.06/inputs")
chmap = lmeta.channelmap(on=first_key) # get the channel map
geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
dl.set_datastreams(geds, "ch")
dt_meta = time.time() - start - dt_files
print(f"time to load meta: \t {dt_meta:.3f} s")
dl.set_cuts({"hit": "trapEmax_ctc_cal > 25"})
dl.set_output(columns=["AoE_Classifier"], fmt="lgdo.Table")
geds_el = dl.build_entry_list(tcm_level="tcm")
dt_tcm = time.time() -start - dt_meta
print(f"time to load tcm: \t {dt_tcm:.3f} s")
dl.load(geds_el)
dt_end = time.time() - start - dt_tcm
print(f"time to load data: \t {dt_end:.3f} s")
if __name__ == "__main__":
print("Try low level:")
low_level_load()
print("Try Data loader:")
data_loader_load()
First, I would ask you if you can find any unfair treatments in one or the other routine. Booth routines should:
- Use data from period 03 run 000 for 10 detectors
- Set a cut on the energy > 25 keV
- Load the AoE values for the 10 geds where the cut is valid according to the TCM
The result on LNGS of the script is:
> python speed_test.py
Try low level:
time to located files: 0.003 s
time to load meta: 0.931 s
time to load tcm: 1.605 s
time to load data: 10.845 s
Try Data loader:
time to located files: 0.918 s
time to load meta: 0.763 s
time to load tcm: 172.463 s
time to load data: 7.972 s
what is going on with the tcm loading!
okay thanks for testing, will check this asap.
It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and https://github.com/h5py/h5py/issues/1597
We're going to try the same workaround in read_object in order to speed up calls when an idx is provided.
Where do we do indexed reading in the data loader other than in .load(), when an entry list is used?
the idx read appears to have been a red herring. I found a factor of ~3 speed up for that tcm read step (really: build entry list) in data_loader and another factor of ~2 in lgdo.table.get_dataframe. However the overall read is still a factor of ~3 slower than Patrick's low-level read:
Singularity> python kraus_speed_test.py
Try low level:
time to located files: 0.003 s
time to load meta: 1.050 s
time to load tcm: 1.632 s
time to load data: 9.426 s
Try Data loader:
time to located files: 0.726 s
time to load meta: 0.891 s
time to load tcm: 26.584 s
time to load data: 7.684 s
There is a lot of complexity in data_loader having to do with dealing with multi-level cuts and other generalizations, but I hope it can be sped up more with some refactoring. It looks like there is still a lot going on in the inner loops.
Nice! Performance of the LGDO conversion methods is also a topic for https://github.com/legend-exp/legend-pydataobj/pull/30 by @MoritzNeuberger.
Did anyone try profiling the code to spot straightforward bottlenecks?