icefall
icefall copied to clipboard
`StopIteration` while trying to resume training from a checkpoint
Hi, there. I was trying to resume the training from a checkpoint checkpoint-28000.pt. But it looks like the train sampler iterates until StopIteration
2022-08-24 23:20:54,372 INFO [train.py:940] (3/4) Training started
2022-08-24 23:20:54,372 INFO [train.py:950] (3/4) Device: cuda:3
2022-08-24 23:20:54,373 INFO [train.py:940] (1/4) Training started
2022-08-24 23:20:54,374 INFO [train.py:950] (1/4) Device: cuda:1
2022-08-24 23:20:54,375 INFO [train.py:940] (2/4) Training started
2022-08-24 23:20:54,375 INFO [train.py:950] (2/4) Device: cuda:2
2022-08-24 23:20:54,377 INFO [train.py:940] (0/4) Training started
2022-08-24 23:20:54,395 INFO [train.py:950] (0/4) Device: cuda:0
2022-08-24 23:20:55,915 INFO [lexicon.py:176] (1/4) Loading pre-compiled data/lang_char/Linv.pt
2022-08-24 23:20:55,926 INFO [lexicon.py:176] (3/4) Loading pre-compiled data/lang_char/Linv.pt
2022-08-24 23:20:55,933 INFO [lexicon.py:176] (2/4) Loading pre-compiled data/lang_char/Linv.pt
2022-08-24 23:20:55,935 INFO [lexicon.py:176] (0/4) Loading pre-compiled data/lang_char/Linv.pt
2022-08-24 23:20:56,297 INFO [train.py:972] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7b7a8a5f8a521b459a11732ef7cfbfe6e1693867', 'k2-git-date': 'Sun Aug 7 02:21:44 2022', 'lhotse-version': '1.6.0.dev+git.1d14fc2.clean', 'torch-version': '1.7.1', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'gigaspeech_streaming', 'icefall-git-sha1': '5ea4d94-clean', 'icefall-git-date': 'Mon Aug 22 20:10:24 2022', 'icefall-path': '/my-t4gpu-spot-02/guanbo/k2/icefall', 'k2-path': '/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.17.dev20220809+cuda10.1.torch1.7.1-py3.8-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/my02/guanbo/k2/lhotse/lhotse/__init__.py', 'hostname': 'my-t4gpu-spot-01', 'IP address': '10.138.0.36'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 28000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 4000, 'keep_last_k': 60, 'average_period': 1000, 'use_fp16': True, 'num_encoder_layers': 18, 'dim_feedforward': 2048, 'nhead': 8, 'encoder_dim': 512, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 40, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 0, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 8340}
2022-08-24 23:20:56,297 INFO [train.py:974] (0/4) About to create model
2022-08-24 23:20:56,302 INFO [train.py:972] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7b7a8a5f8a521b459a11732ef7cfbfe6e1693867', 'k2-git-date': 'Sun Aug 7 02:21:44 2022', 'lhotse-version': '1.6.0.dev+git.1d14fc2.clean', 'torch-version': '1.7.1', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'gigaspeech_streaming', 'icefall-git-sha1': '5ea4d94-clean', 'icefall-git-date': 'Mon Aug 22 20:10:24 2022', 'icefall-path': '/my-t4gpu-spot-02/guanbo/k2/icefall', 'k2-path': '/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.17.dev20220809+cuda10.1.torch1.7.1-py3.8-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/my02/guanbo/k2/lhotse/lhotse/__init__.py', 'hostname': 'my-t4gpu-spot-01', 'IP address': '10.138.0.36'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 28000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 4000, 'keep_last_k': 60, 'average_period': 1000, 'use_fp16': True, 'num_encoder_layers': 18, 'dim_feedforward': 2048, 'nhead': 8, 'encoder_dim': 512, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 40, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 0, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 8340}
2022-08-24 23:20:56,303 INFO [train.py:974] (3/4) About to create model
2022-08-24 23:20:56,303 INFO [train.py:972] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7b7a8a5f8a521b459a11732ef7cfbfe6e1693867', 'k2-git-date': 'Sun Aug 7 02:21:44 2022', 'lhotse-version': '1.6.0.dev+git.1d14fc2.clean', 'torch-version': '1.7.1', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'gigaspeech_streaming', 'icefall-git-sha1': '5ea4d94-clean', 'icefall-git-date': 'Mon Aug 22 20:10:24 2022', 'icefall-path': '/my-t4gpu-spot-02/guanbo/k2/icefall', 'k2-path': '/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.17.dev20220809+cuda10.1.torch1.7.1-py3.8-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/my02/guanbo/k2/lhotse/lhotse/__init__.py', 'hostname': 'my-t4gpu-spot-01', 'IP address': '10.138.0.36'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 28000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 4000, 'keep_last_k': 60, 'average_period': 1000, 'use_fp16': True, 'num_encoder_layers': 18, 'dim_feedforward': 2048, 'nhead': 8, 'encoder_dim': 512, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 40, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 0, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 8340}
2022-08-24 23:20:56,303 INFO [train.py:974] (1/4) About to create model
2022-08-24 23:20:56,308 INFO [train.py:972] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 500, 'reset_interval': 2000, 'valid_interval': 20000, 'feature_dim': 80, 'subsampling_factor': 4, 'model_warm_step': 20000, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7b7a8a5f8a521b459a11732ef7cfbfe6e1693867', 'k2-git-date': 'Sun Aug 7 02:21:44 2022', 'lhotse-version': '1.6.0.dev+git.1d14fc2.clean', 'torch-version': '1.7.1', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'gigaspeech_streaming', 'icefall-git-sha1': '5ea4d94-clean', 'icefall-git-date': 'Mon Aug 22 20:10:24 2022', 'icefall-path': '/my-t4gpu-spot-02/guanbo/k2/icefall', 'k2-path': '/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.17.dev20220809+cuda10.1.torch1.7.1-py3.8-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/my02/guanbo/k2/lhotse/lhotse/__init__.py', 'hostname': 'my-t4gpu-spot-01', 'IP address': '10.138.0.36'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 28000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 4000, 'keep_last_k': 60, 'average_period': 1000, 'use_fp16': True, 'num_encoder_layers': 18, 'dim_feedforward': 2048, 'nhead': 8, 'encoder_dim': 512, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 40, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 0, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'subset': 'XL', 'small_dev': False, 'blank_id': 0, 'vocab_size': 8340}
2022-08-24 23:20:56,308 INFO [train.py:974] (2/4) About to create model
2022-08-24 23:20:57,165 INFO [train.py:978] (1/4) Number of model parameters: 132633420
2022-08-24 23:20:57,166 INFO [checkpoint.py:112] (1/4) Loading checkpoint from pruned_transducer_stateless5/exp/checkpoint-28000.pt
2022-08-24 23:20:57,166 INFO [train.py:978] (3/4) Number of model parameters: 132633420
2022-08-24 23:20:57,166 INFO [checkpoint.py:112] (3/4) Loading checkpoint from pruned_transducer_stateless5/exp/checkpoint-28000.pt
2022-08-24 23:20:57,168 INFO [train.py:978] (0/4) Number of model parameters: 132633420
2022-08-24 23:20:57,169 INFO [train.py:978] (2/4) Number of model parameters: 132633420
2022-08-24 23:20:57,169 INFO [checkpoint.py:112] (2/4) Loading checkpoint from pruned_transducer_stateless5/exp/checkpoint-28000.pt
2022-08-24 23:20:57,636 INFO [checkpoint.py:112] (0/4) Loading checkpoint from pruned_transducer_stateless5/exp/checkpoint-28000.pt
2022-08-24 23:21:15,231 INFO [checkpoint.py:131] (0/4) Loading averaged model
2022-08-24 23:21:15,420 INFO [train.py:993] (1/4) Using DDP
2022-08-24 23:21:15,423 INFO [train.py:993] (3/4) Using DDP
2022-08-24 23:21:15,424 INFO [train.py:993] (2/4) Using DDP
2022-08-24 23:21:15,573 INFO [train.py:993] (0/4) Using DDP
2022-08-24 23:21:15,701 INFO [train.py:1001] (1/4) Loading optimizer state dict
2022-08-24 23:21:15,701 INFO [train.py:1001] (2/4) Loading optimizer state dict
2022-08-24 23:21:15,701 INFO [train.py:1001] (3/4) Loading optimizer state dict
2022-08-24 23:21:15,701 INFO [train.py:1001] (0/4) Loading optimizer state dict
2022-08-24 23:21:16,319 INFO [train.py:1009] (0/4) Loading scheduler state dict
2022-08-24 23:21:16,320 INFO [asr_datamodule.py:404] (0/4) About to get train_combined cuts
2022-08-24 23:21:16,320 INFO [asr_datamodule.py:411] (0/4) About to get dev cuts
2022-08-24 23:21:16,334 INFO [asr_datamodule.py:346] (0/4) About to create dev dataset
2022-08-24 23:21:16,444 INFO [train.py:1009] (1/4) Loading scheduler state dict
2022-08-24 23:21:16,444 INFO [asr_datamodule.py:404] (1/4) About to get train_combined cuts
2022-08-24 23:21:16,444 INFO [asr_datamodule.py:411] (1/4) About to get dev cuts
2022-08-24 23:21:16,458 INFO [asr_datamodule.py:346] (1/4) About to create dev dataset
2022-08-24 23:21:16,467 INFO [train.py:1009] (2/4) Loading scheduler state dict
2022-08-24 23:21:16,467 INFO [asr_datamodule.py:404] (2/4) About to get train_combined cuts
2022-08-24 23:21:16,467 INFO [asr_datamodule.py:411] (2/4) About to get dev cuts
2022-08-24 23:21:16,470 INFO [train.py:1009] (3/4) Loading scheduler state dict
2022-08-24 23:21:16,471 INFO [asr_datamodule.py:404] (3/4) About to get train_combined cuts
2022-08-24 23:21:16,471 INFO [asr_datamodule.py:411] (3/4) About to get dev cuts
2022-08-24 23:21:16,472 INFO [asr_datamodule.py:346] (3/4) About to create dev dataset
2022-08-24 23:21:16,483 INFO [asr_datamodule.py:346] (2/4) About to create dev dataset
2022-08-24 23:21:16,907 INFO [asr_datamodule.py:367] (0/4) About to create dev dataloader
2022-08-24 23:21:16,907 INFO [asr_datamodule.py:217] (0/4) Enable MUSAN
2022-08-24 23:21:16,907 INFO [asr_datamodule.py:218] (0/4) About to get Musan cuts
2022-08-24 23:21:17,045 INFO [asr_datamodule.py:367] (1/4) About to create dev dataloader
2022-08-24 23:21:17,045 INFO [asr_datamodule.py:217] (1/4) Enable MUSAN
2022-08-24 23:21:17,045 INFO [asr_datamodule.py:218] (1/4) About to get Musan cuts
2022-08-24 23:21:17,056 INFO [asr_datamodule.py:367] (2/4) About to create dev dataloader
2022-08-24 23:21:17,056 INFO [asr_datamodule.py:217] (2/4) Enable MUSAN
2022-08-24 23:21:17,056 INFO [asr_datamodule.py:218] (2/4) About to get Musan cuts
2022-08-24 23:21:17,057 INFO [asr_datamodule.py:367] (3/4) About to create dev dataloader
2022-08-24 23:21:17,058 INFO [asr_datamodule.py:217] (3/4) Enable MUSAN
2022-08-24 23:21:17,058 INFO [asr_datamodule.py:218] (3/4) About to get Musan cuts
2022-08-24 23:21:18,776 INFO [asr_datamodule.py:246] (0/4) Enable SpecAugment
2022-08-24 23:21:18,776 INFO [asr_datamodule.py:247] (0/4) Time warp factor: 80
2022-08-24 23:21:18,776 INFO [asr_datamodule.py:259] (0/4) Num frame mask: 10
2022-08-24 23:21:18,776 INFO [asr_datamodule.py:272] (0/4) About to create train dataset
2022-08-24 23:21:18,776 INFO [asr_datamodule.py:300] (0/4) Using DynamicBucketingSampler.
2022-08-24 23:21:18,908 INFO [asr_datamodule.py:246] (2/4) Enable SpecAugment
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:247] (2/4) Time warp factor: 80
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:259] (2/4) Num frame mask: 10
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:272] (2/4) About to create train dataset
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:300] (2/4) Using DynamicBucketingSampler.
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:246] (3/4) Enable SpecAugment
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:247] (3/4) Time warp factor: 80
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:259] (3/4) Num frame mask: 10
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:272] (3/4) About to create train dataset
2022-08-24 23:21:18,909 INFO [asr_datamodule.py:300] (3/4) Using DynamicBucketingSampler.
2022-08-24 23:21:18,929 INFO [asr_datamodule.py:246] (1/4) Enable SpecAugment
2022-08-24 23:21:18,929 INFO [asr_datamodule.py:247] (1/4) Time warp factor: 80
2022-08-24 23:21:18,929 INFO [asr_datamodule.py:259] (1/4) Num frame mask: 10
2022-08-24 23:21:18,929 INFO [asr_datamodule.py:272] (1/4) About to create train dataset
2022-08-24 23:21:18,929 INFO [asr_datamodule.py:300] (1/4) Using DynamicBucketingSampler.
2022-08-24 23:21:21,787 INFO [asr_datamodule.py:315] (2/4) About to create train dataloader
2022-08-24 23:21:21,787 INFO [asr_datamodule.py:318] (2/4) Loading sampler state dict
2022-08-24 23:21:21,815 INFO [asr_datamodule.py:315] (3/4) About to create train dataloader
2022-08-24 23:21:21,815 INFO [asr_datamodule.py:318] (3/4) Loading sampler state dict
2022-08-24 23:21:21,859 INFO [asr_datamodule.py:315] (1/4) About to create train dataloader
2022-08-24 23:21:21,859 INFO [asr_datamodule.py:318] (1/4) Loading sampler state dict
2022-08-24 23:21:21,903 INFO [asr_datamodule.py:315] (0/4) About to create train dataloader
2022-08-24 23:21:21,904 INFO [asr_datamodule.py:318] (0/4) Loading sampler state dict
Traceback (most recent call last):
File "./pruned_transducer_stateless5/train.py", line 1218, in <module>
main()
File "./pruned_transducer_stateless5/train.py", line 1209, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/my-t4gpu-01/guanbo/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/my-t4gpu-spot-02/guanbo/k2/icefall/egs/my_tw/ASR/pruned_transducer_stateless5/train.py", line 1056, in run
train_dl = myspeech.train_dataloaders(
File "/my-t4gpu-spot-02/guanbo/k2/icefall/egs/my_tw/ASR/pruned_transducer_stateless5/asr_datamodule.py", line 319, in train_dataloaders
train_sampler.load_state_dict(sampler_state_dict)
File "/my02/guanbo/k2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 174, in load_state_dict
self._fast_forward()
File "/my02/guanbo/k2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 190, in _fast_forward
next(self)
File "/my02/guanbo/k2/lhotse/lhotse/dataset/sampling/base.py", line 261, in __next__
batch = self._next_batch()
File "/my02/guanbo/k2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 232, in _next_batch
batch = next(self.cuts_iter)
StopIteration
I’m aware of the issue, it’s on my list as soon as I get a bit more time. Please watch https://github.com/lhotse-speech/lhotse/issues/785
Hi, @pzelasko. Is there any updates on this issue?
Unfortunately no, I can't find any time to work on this (but still remember about it). If somebody could help in debugging it that would be great.
So I assume the issue is that we have recorded the batch number within the epoch, but when it tries to step that many batches it runs out of data before it reaches the right point?
Did you change anything before re-starting, like the --max-duration or number of jobs? I can imagine that if settings like that were changed, the same number of batches might exhaust the data loader.
So I assume the issue is that we have recorded the batch number within the epoch, but when it tries to step that many batches it runs out of data before it reaches the right point?
Yes. I wanted to resume training from a checkpoint within the epoch, but seems that it iterated too many steps.
Did you change anything before re-starting, like the --max-duration or number of jobs? I can imagine that if settings like that were changed, the same number of batches might exhaust the data loader.
No, I didn't change any settings.
I’ll try to revisit the issue today or tomorrow.
In the meantime, if it’s giving you trouble, the easiest workaround for the bug is to not resume the sampler checkpoint. For sufficiently large data it probably won’t make too much difference (unless you restart very frequently).
I moved the discussion to Lhotse issue https://github.com/lhotse-speech/lhotse/issues/785#issuecomment-1262978092