Piotr Żelasko
Piotr Żelasko
Uhhh sorry this doc is too vague. I meant that using `executor_type=ProcessPoolExecutor` is incompatible with `DataLoader(num_workers > 0)`, because DataLoader's worker sub-processes cannot spawn their own sub-sub-process pools (I think...
Thread pools are supposed to work just fine (but there's an upper bound on how much speed you'll be able to gain with them)
It's possible to leverage lazy cuts in Lhotse to reduce the memory overhead. You can use the following: - creating cuts from recordings and supervisions: [`lhotse.cut.create_cut_set_lazy`](https://github.com/lhotse-speech/lhotse/blob/f1b66b8a8db2ea93e87dcb9db3991f6dd473b89d/lhotse/cut.py#L5063) (make sure to read...
That’d be the most likely explanation. There’s a num jobs argument that can help speed this up https://github.com/lhotse-speech/lhotse/blob/f1b66b8a8db2ea93e87dcb9db3991f6dd473b89d/lhotse/kaldi.py#L60
> BTW - I don't think it would help to increase the jobs for that scenario, because the speed will be disk-io limited. If Kaldi's `utils/data/get_reco2dur.sh` or `steps/make_mfcc.sh` gets faster...
> what it's taking much more time w.r.t. to Kaldi is then the feature extraction after the import; I guess it's re-opening the same file each time the feature is...
Interesting project! I’m open to replacing it but might be tough for me to find the time right now. If you guys need it please make a PR.
Is this issue resolved, or is there anything else we can do?
You can pull latest master, install Fangjun’s native kaldiio library, set a higher number of jobs for the import script, and let us know if it’s better then.
Icefall's LibriSpeech recipe uses Lhotse in a way that is suitable for small and medium sized datasets -- it reads the whole manifest into memory and does various operations on...