lhotse
lhotse copied to clipboard
"BucketingSampler does not support working with lazy CutSet" when running icefall recipes
This commit https://github.com/lhotse-speech/lhotse/commit/0dceff169c9af0c70c8eda1266640a85409617e9 seems to break running icefall/egs/librispeech/ASR/*/train.py.
I now get the ValueError raised ("BucketingSampler does not support working with lazy CutSet") when running: python3 ./pruned_transducer_stateless2/train.py --exp-dir=./pruned_transducer_stateless2/exp --world-size 1 --num-epochs 26 --full-libri 1 --max-duration 300
.
I am using the librispeech datasets which are prepared in icefall and I have not modified anything.
@pzelasko whats the best way forward since I believe you added this raise condition? Thanks!
There are two solutions:
- to leverage CPU memory savings, just switch to
DynamicBucketingSampler
(most args are the same) - manually trigger loading all cuts into memory by calling
cuts.to_eager()
, e.g.BucketingSampler(cuts.to_eager(), max_duration=…)
I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler
.
I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler.
Both are fine for me. Shall we replace all ".json.gz" in icefall with ".jsonl.gz"
I’ll try to find some time to update Icefall, would be good to get some feedback from @csukuangfj and @danpovey which option they prefer; my recommendation is DynamicBucketingSampler.
Both are fine for me. Shall we replace all ".json.gz" in icefall with ".jsonl.gz"
Thanks for taking this up. All the uses of musan
would also need to be updated. For e.g.: recordings_music.jsonl
-> musan_recordings_music.jsonl.gz
, and similarly for other datasets which originally did not have a <corpus-name>
prefix.
Thanks everyone, could you let me know when you've added a fix to icefall? For now I will use DynamicBucketingSampler
. Also, newbie question - what is a jsonl vs json?
my understanding is that jsonl is json but formatted in such a way that it's one record per line. there might be a finer definition, but I have so far survived with this :) y.
On Fri, May 20, 2022 at 10:14 AM John Hughes @.***> wrote:
Thanks everyone, could you let me know when you've added a fix to icefall? For now I will use DynamicBucketingSampler. Also, newbie question - what is a jsonl vs json?
— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/721#issuecomment-1132948283, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4OGRQ6XM2C734XRETVK6M3JANCNFSM5WLZ5EFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Also, newbie question - what is a jsonl vs json?
https://jsonlines.org