kaldi-io-for-python icon indicating copy to clipboard operation
kaldi-io-for-python copied to clipboard

Only load small parts of a big file

Open HuangZiliAndy opened this issue 5 years ago • 2 comments

Hi, my situation is that I want to load small parts of a big ark file. Of course, it is possible to load the entire ark file and then select certain rows, but it is not memory and time efficient. I wonder if it is possible to read only small parts of the ark file? (like np.load('/tmp/123.npy', mmap_mode='r')) Thanks for your help!

HuangZiliAndy avatar Jun 07 '19 15:06 HuangZiliAndy

Maybe you can use a multiprocessing queue to build a streaming data pipeline if you process the big ark file in sequence.

xx205 avatar Jul 01 '19 07:07 xx205

Hi, the 'kaldi' way is to dump/resave the 'ark' as 'ark,scp' by 'copy-feats'. And then, in Python you can build dict() the from 'scp' and read with 'read_mat([2nd_col_from scp])' only the utterances that you need.

Does this solve your problem? All the best, Karel

KarelVesely84 avatar Feb 25 '20 18:02 KarelVesely84