lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

"BucketingSampler does not support working with lazy CutSet" when running icefall recipes

Open jplhughes opened this issue 2 years ago • 6 comments

This commit https://github.com/lhotse-speech/lhotse/commit/0dceff169c9af0c70c8eda1266640a85409617e9 seems to break running icefall/egs/librispeech/ASR/*/train.py.

I now get the ValueError raised ("BucketingSampler does not support working with lazy CutSet") when running: python3 ./pruned_transducer_stateless2/train.py --exp-dir=./pruned_transducer_stateless2/exp --world-size 1 --num-epochs 26 --full-libri 1 --max-duration 300.

I am using the librispeech datasets which are prepared in icefall and I have not modified anything.

@pzelasko whats the best way forward since I believe you added this raise condition? Thanks!

jplhughes avatar May 19 '22 10:05 jplhughes

There are two solutions:

  • to leverage CPU memory savings, just switch to DynamicBucketingSampler (most args are the same)
  • manually trigger loading all cuts into memory by calling cuts.to_eager(), e.g. BucketingSampler(cuts.to_eager(), max_duration=…)

I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler.

pzelasko avatar May 19 '22 11:05 pzelasko

I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler.

Both are fine for me. Shall we replace all ".json.gz" in icefall with ".jsonl.gz"

csukuangfj avatar May 19 '22 11:05 csukuangfj

I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler.

Both are fine for me. Shall we replace all ".json.gz" in icefall with ".jsonl.gz"

Thanks for taking this up. All the uses of musan would also need to be updated. For e.g.: recordings_music.jsonl -> musan_recordings_music.jsonl.gz, and similarly for other datasets which originally did not have a <corpus-name> prefix.

desh2608 avatar May 19 '22 12:05 desh2608

Thanks everyone, could you let me know when you've added a fix to icefall? For now I will use DynamicBucketingSampler. Also, newbie question - what is a jsonl vs json?

jplhughes avatar May 20 '22 14:05 jplhughes

my understanding is that jsonl is json but formatted in such a way that it's one record per line. there might be a finer definition, but I have so far survived with this :) y.

On Fri, May 20, 2022 at 10:14 AM John Hughes @.***> wrote:

Thanks everyone, could you let me know when you've added a fix to icefall? For now I will use DynamicBucketingSampler. Also, newbie question - what is a jsonl vs json?

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/721#issuecomment-1132948283, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4OGRQ6XM2C734XRETVK6M3JANCNFSM5WLZ5EFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jtrmal avatar May 20 '22 14:05 jtrmal

Also, newbie question - what is a jsonl vs json?

https://jsonlines.org

pzelasko avatar May 20 '22 14:05 pzelasko