kaldi-io-for-python
kaldi-io-for-python copied to clipboard
Only load small parts of a big file
Hi, my situation is that I want to load small parts of a big ark file. Of course, it is possible to load the entire ark file and then select certain rows, but it is not memory and time efficient. I wonder if it is possible to read only small parts of the ark file? (like np.load('/tmp/123.npy', mmap_mode='r')) Thanks for your help!
Maybe you can use a multiprocessing queue to build a streaming data pipeline if you process the big ark file in sequence.
Hi, the 'kaldi' way is to dump/resave the 'ark' as 'ark,scp' by 'copy-feats'. And then, in Python you can build dict() the from 'scp' and read with 'read_mat([2nd_col_from scp])' only the utterances that you need.
Does this solve your problem? All the best, Karel