icefall Issues with long cutset during zipformer training

Hello All, we are training a zip former model for about 3400 hours of Tamil data. This is in reference with https://github.com/k2-fsa/icefall/issues/1751 We have NVIDIA A6000 50GB GPU. Getting the below error: 2024-12-19 16:44:26,604 INFO [asr_datamodule.py:375] About to create dev dataloader 2024-12-19 16:44:26,604 INFO [train.py:1326] Sanity check -- see if any of the batches in epoch 1 would cause OOM. /home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic.py:342: UserWarning: We have exceeded the max_duration constraint during sampling but have only 1 cut. This is likely because max_duration was set to a very low value ~10s, or you're using a CutSet with very long cuts (e.g. 100s of seconds long). warnings.warn( 2024-12-19 16:58:57,481 ERROR [train.py:1345] Your GPU ran out of memory with the current max_duration setting. We recommend decreasing max_duration and trying again. Failing criterion: single_longest_cut (=162.26) ... 2024-12-19 16:58:57,482 INFO [train.py:1304] Saving batch to zipformer/exp/batch-6c307511-b2b9-437a-28df-6ec4ce4a2bbd.pt 2024-12-19 16:59:01,314 INFO [train.py:1310] features shape: torch.Size([7, 16226, 80]) 2024-12-19 16:59:01,315 INFO [train.py:1314] num tokens: 817 Traceback (most recent call last): File "./zipformer/train.py", line 1380, in main() File "./zipformer/train.py", line 1373, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1225, in run scan_pessimistic_batches_for_oom( File "./zipformer/train.py", line 1341, in scan_pessimistic_batches_for_oom loss.backward() File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/armuser/k2_249/icefall/egs/tamil/ASR/zipformer/scaling.py", line 313, in backward x_grad = x_grad - ans * x_grad.sum(dim=ctx.dim, keepdim=True) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 37.04 GiB already allocated; 3.94 GiB free; 41.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Training command: ./zipformer/train.py --world-size 1 --num-epochs 30 --start-batch 336000 --use-fp16 1 --exp-dir zipformer/exp --max-duration 100 Initially had kept the max duration as 150 The training had completed for 4 epochs.Then got the above issue.Loaded the batch.pt file and got as below: 'sequence_idx': tensor([0, 1, 2, 3, 4, 5, 6], dtype=torch.int32), 'start_frame': tensor([0, 0, 0, 0, 0, 0, 0], dtype=torch.int32), 'num_frames': tensor([16226, 2022, 1926, 1744, 1716, 1676, 1613], dtype=torch.int32), 'cut': [MonoCut(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, supervisions=[SupervisionSegment(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, text='அவ்வையார் விருது தமிழ்நாட்டில் சமூகநலப் பணிகளை அரப்பணிப்புடன் செயலாற்றியதாக 2020ஆம் ஆண்டிற்கான அவ்வையார் விருதுக்கு தேர்வு செய்யப்பட்ட திருவண்ணாமலையைச் சேர்ந்த சமூக சேவகி திருமதி', language=None, speaker='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', gender=None, custom={'origin': 'giga'}, alignment=None)], features=Features(type='kaldifeat-fbank', num_frames=16226, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0, duration=162.26, storage_type='lilcom_chunky', storage_path='/home/armuser/10TBHDD/CUDA_11.6/icefall/egs/tamil/ASR/data/fbank/train_split/tamil_feats_train_00032581.lca', storage_key='964876,45872,45111,44652,45255,45498,44806,45091,45317,44865,45016,44804,44720,44784,44749,45046,44983,44943,45297,44866,45335,45125,45507,44978,44909,44841,44914,44718,44569,45297,44670,45390,44619,20203', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', channels=0), recording=Recording(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', sources=[AudioSource(type='file', channels=[0], source='/media/ASR_database/shruthilipi_data/tamil/newsonair_renamed /Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97.wav')], sampling_rate=8000, num_samples=1298080, duration=162.26, channel_ids=[0], transforms=None),

Kindly suggest how to go about this issue.

Dec 20 '24 16:12 bsshruthi22

we have a function in train.py to remove long and short utterances, which is enabled by default. Please don't disable it.

Dec 20 '24 18:12 csukuangfj

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

Dec 21 '24 04:12 bsshruthi22

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

which file.are you referring to?

Please recheck.

Dec 21 '24 04:12 csukuangfj

@csukuangfj I am using this file : https://github.com/k2-fsa/icefall/blob/master/egs/gigaspeech/ASR/zipformer/train.py

Dec 21 '24 05:12 bsshruthi22

please refer to librispeech

Dec 21 '24 09:12 csukuangfj

https://github.com/k2-fsa/icefall/blob/ad966fb81d76c9b6780cac6844d9c4aa1782a46b/egs/librispeech/ASR/zipformer/train.py#L1377-L1385

@bsshruthi22

please read the comment in train.py carefully.

Dec 24 '24 02:12 csukuangfj

@csukuangfj ok. thanks for your suggestion. .Now the training has resumed. Hopefully it gets completed without any error.

Dec 24 '24 06:12 bsshruthi22

@csukuangfj Is there anyway to retain audios which are greater than 20s or less than 1s by doing any modification to cuts so that it doesn't give error?

Dec 29 '24 10:12 bsshruthi22

icefall icefall copied to clipboard

Issues with long cutset during zipformer training

icefall
icefall copied to clipboard