HDFDataset startup for huge dataset is slow
It takes maybe 20 mins or so. (I did not measure that yet.)
The HDF file is 38G large and has 40M seqs with 4.613B frames.
Via dump-dataset.py:
Returnn dump-dataset starting up.
RETURNN starting up, version 1.20250419.000437+git.9b2d2298, date/time 2025-04-19-12-49-12 (UTC+0200), pid 188254, cwd /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31, Python /work/az668407/py-envs/py3.12-torch2.4/bin/python3
Hostname: login23-3.hpc.itc.rwth-aachen.de
Installed native_signal_handler.so.
Dataset:
input: 10240 x 1
output: {'enc_seq_lens': (1, 1), 'log_probs': (1, 1), 'output_k_lob_probs': (5, 2), 'sizes': (1, 1), 'data': [10240, 1]}
HDFDataset, sequences: 40310479, frames: 4613309655
Epoch: 1
Reinitialize dataset seq order for epoch 1.
Dataset keys: ['data', 'enc_seq_lens', 'log_probs', 'output_k_lob_probs', 'sizes']
Dataset target keys: ['enc_seq_lens', 'log_probs', 'output_k_lob_probs', 'sizes']
Dataset labels: 'enc_seq_lens': ['dummy-label']... len 1, 'log_probs': ['dummy-label']... len 1, 'output_k_lob_probs': ['dummy-label']... len 1, 'sizes': ['dummy-label']... len 1, 'data': ['</s>', '<s>', '<unk>']... len 10240
Dump to stdout
...
Py-spy flamegraph:
Line 137 in add_file in hdf.py is this line:
seq_lengths = fin[attr_seqLengths][...] # shape (num_seqs,num_target_keys + 1)
So this takes most of the time (99% or so?). This is somewhat the main read from the HDF inside the add_file function, so it's reasonable that this takes maybe most of the time, but I would not have expected that this takes so long.
Note that py-spy gave a few warnings like these:
py-spy> 1.05s behind in sampling, results may be inaccurate. Try reducing the sampling rate
py-spy> 1.33s behind in sampling, results may be inaccurate. Try reducing the sampling rate
py-spy> 1.05s behind in sampling, results may be inaccurate. Try reducing the sampling rate
py-spy> 1.38s behind in sampling, results may be inaccurate. Try reducing the sampling rate
py-spy> 1.96s behind in sampling, results may be inaccurate. Try reducing the sampling rate
In any case, we should try to make this faster.
Using LmDataset on a similarly sized dataset takes about 1 minute.
(cc @dorian-K @patrick-wilken @NeoLegends) (slightly related is also #1669)
Ok, one thing which seems to make a massive difference: Whether the HDF file is on a network file system (here: Lustre; I assume NFS is similar), or whether it is on a local SSD.
Testing h5py.File(filename, "r")["seqLengths"][...]:
- SSD: About 1.8 secs.
- Network: About 20 mins!
And copying the file from network to local disk (use_cache_manager=True) takes only about 1 min. So this basically seems to be the solution.