HDFDataset startup for huge dataset is slow

Open albertz opened this issue 8 months ago • 1 comments

It takes maybe 20 mins or so. (I did not measure that yet.)

The HDF file is 38G large and has 40M seqs with 4.613B frames.

Via dump-dataset.py:

Returnn dump-dataset starting up.                                                                                                                                         
RETURNN starting up, version 1.20250419.000437+git.9b2d2298, date/time 2025-04-19-12-49-12 (UTC+0200), pid 188254, cwd /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31, Python /work/az668407/py-envs/py3.12-torch2.4/bin/python3                                                                                                       
Hostname: login23-3.hpc.itc.rwth-aachen.de                                                                                                                                
Installed native_signal_handler.so.                                                                                                                                       
Dataset:                                                                                                                                                                  
  input: 10240 x 1                                                                                                                                                        
  output: {'enc_seq_lens': (1, 1), 'log_probs': (1, 1), 'output_k_lob_probs': (5, 2), 'sizes': (1, 1), 'data': [10240, 1]}                                                
  HDFDataset, sequences: 40310479, frames: 4613309655                                                                                                                     
Epoch: 1                                                                                                                                                                  
Reinitialize dataset seq order for epoch 1.                                                                                                                               
Dataset keys: ['data', 'enc_seq_lens', 'log_probs', 'output_k_lob_probs', 'sizes']                                                                                        
Dataset target keys: ['enc_seq_lens', 'log_probs', 'output_k_lob_probs', 'sizes']                                                                                         
Dataset labels: 'enc_seq_lens': ['dummy-label']... len 1, 'log_probs': ['dummy-label']... len 1, 'output_k_lob_probs': ['dummy-label']... len 1, 'sizes': ['dummy-label']... len 1, 'data': ['</s>', '<s>', '<unk>']... len 10240
Dump to stdout
...

Py-spy flamegraph:

Line 137 in add_file in hdf.py is this line:

seq_lengths = fin[attr_seqLengths][...]  # shape (num_seqs,num_target_keys + 1)

So this takes most of the time (99% or so?). This is somewhat the main read from the HDF inside the add_file function, so it's reasonable that this takes maybe most of the time, but I would not have expected that this takes so long.

Note that py-spy gave a few warnings like these:

py-spy> 1.05s behind in sampling, results may be inaccurate. Try reducing the sampling rate                                                                               
py-spy> 1.33s behind in sampling, results may be inaccurate. Try reducing the sampling rate                                                                               
py-spy> 1.05s behind in sampling, results may be inaccurate. Try reducing the sampling rate                                                                               
py-spy> 1.38s behind in sampling, results may be inaccurate. Try reducing the sampling rate                                                                               
py-spy> 1.96s behind in sampling, results may be inaccurate. Try reducing the sampling rate

In any case, we should try to make this faster.

Using LmDataset on a similarly sized dataset takes about 1 minute.

(cc @dorian-K @patrick-wilken @NeoLegends) (slightly related is also #1669)

Apr 19 '25 11:04 albertz

Ok, one thing which seems to make a massive difference: Whether the HDF file is on a network file system (here: Lustre; I assume NFS is similar), or whether it is on a local SSD.

Testing h5py.File(filename, "r")["seqLengths"][...]:

SSD: About 1.8 secs.
Network: About 20 mins!

And copying the file from network to local disk (use_cache_manager=True) takes only about 1 min. So this basically seems to be the solution.

Apr 19 '25 14:04 albertz