icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Losses are Nan and Infinite

Open SSwethaSel0609 opened this issue 10 months ago • 3 comments

I'm finetuning the model in zipformer. When I finetune with 100hrs of data, there was no issue but when I finetune the model with 3000hrs of data I'm facing the infinity or nan losses. What will be the cause for this issue [1,mpirank:5,algo-1]:2025-01-19 09:00:32,064 INFO [finetune.py:1142] (5/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3310.00 frames. ], tot_loss[over 792510.56 frames. ], batch size: 14, lr: 4.28e-03, [1,mpirank:0,algo-1]:2025-01-19 09:00:32,065 INFO [finetune.py:1142] (0/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4819.00 frames. ], tot_loss[over 814690.14 frames. ], batch size: 58, lr: 4.28e-03, [1,mpirank:6,algo-1]:2025-01-19 09:00:32,068 INFO [finetune.py:1142] (6/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3370.00 frames. ], tot_loss[over 799670.96 frames. ], batch size: 13, lr: 4.28e-03, [1,mpirank:3,algo-1]:2025-01-19 09:00:32,070 INFO [finetune.py:1142] (3/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4945.00 frames. ], tot_loss[over 807011.63 frames. ], batch size: 33, lr: 4.28e-03, [1,mpirank:2,algo-1]:2025-01-19 09:00:32,071 INFO [finetune.py:1142] (2/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4949.00 frames. ], tot_loss[over 812248.61 frames. ], batch size: 66, lr: 4.28e-03, [1,mpirank:1,algo-1]:2025-01-19 09:00:32,073 INFO [finetune.py:1142] (1/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4903.00 frames. ], tot_loss[over 823203.24 frames. ], batch size: 49, lr: 4.28e-03, [1,mpirank:4,algo-1]:2025-01-19 09:00:32,075 INFO [finetune.py:1142] (4/8) Epoch 7, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 4743.00 frames. ], tot_loss[over 806376.22 frames. ], batch size: 27, lr: 4.28e-03,

SSwethaSel0609 avatar Jan 21 '25 05:01 SSwethaSel0609

Which directory did you get the code? The later version in zipformer/ is more stable, there are earlier versions that eventually get unstable like that. If you run from the epoch that failed, i..e epoch 7, with --inf-check=True, it should produce some output that indicates that the problem is.

danpovey avatar Jan 21 '25 14:01 danpovey

How can I find which version I'm using? I have started from epoch 1.

[1,mpirank:5,algo-1]:2025-01-18 18:21:26,565 INFO [finetune.py:1142] (5/8) Epoch 2, batch 5850, loss[loss=0.2428, simple_loss=0.2706, pruned_loss=0.0785, ctc_loss=0.1449, over 2699.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.2835, pruned_loss=0.1019, ctc_loss=0.1808, over 643152.39 frames. ], batch size: 10, lr: 4.48e-03, -- [1,mpirank:2,algo-1]:2025-01-18 18:21:26,572 INFO [finetune.py:1142] (2/8) Epoch 2, batch 5850, loss[loss=0.1814, simple_loss=0.2199, pruned_loss=0.05261, ctc_loss=0.09411, over 2742.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.2847, pruned_loss=0.1039, ctc_loss=0.1847, over 636660.49 frames. ], batch size: 10, lr: 4.48e-03, [1,mpirank:4,algo-1]:2025-01-18 18:21:26,573 INFO [finetune.py:1142] (4/8) Epoch 2, batch 5850, loss[loss=0.3142, simple_loss=0.3101, pruned_loss=0.1158, ctc_loss=0.2165, over 3081.00 frames. ], tot_loss[loss=0.282, simple_loss=0.2847, pruned_loss=0.1029, ctc_loss=0.1838, over 642994.73 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:1,algo-1]:2025-01-18 18:21:26,574 INFO [finetune.py:1142] (1/8) Epoch 2, batch 5850, loss[loss=0.3755, simple_loss=0.3408, pruned_loss=0.1521, ctc_loss=0.2651, over 3001.00 frames. ], tot_loss[loss=0.286, simple_loss=0.2879, pruned_loss=0.105, ctc_loss=0.1852, over 635433.97 frames. ], batch size: 11, lr: 4.48e-03, [1,mpirank:3,algo-1]:2025-01-18 18:23:08,504 INFO [finetune.py:1142] (3/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3194.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.2546, pruned_loss=0.09303, ctc_loss=0.1655, over 636192.78 frames. ], batch size: 11, lr: 4.48e-03, [1,mpirank:6,algo-1]:2025-01-18 18:23:08,505 INFO [finetune.py:1142] (6/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3187.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.2572, pruned_loss=0.09711, ctc_loss=0.171, over 635349.10 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:7,algo-1]:2025-01-18 18:23:08,505 INFO [finetune.py:1142] (7/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3130.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.2522, pruned_loss=0.09163, ctc_loss=0.1638, over 637478.71 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:0,algo-1]:2025-01-18 18:23:08,506 INFO [finetune.py:1142] (0/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 2829.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2517, pruned_loss=0.09079, ctc_loss=0.1618, over 636327.01 frames. ], batch size: 10, lr: 4.48e-03, [1,mpirank:5,algo-1]:2025-01-18 18:23:08,508 INFO [finetune.py:1142] (5/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 2825.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.2527, pruned_loss=0.09098, ctc_loss=0.1624, over 638286.87 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:2,algo-1]:2025-01-18 18:23:08,511 INFO [finetune.py:1142] (2/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3732.00 frames. ], tot_loss[loss=0.251, simple_loss=0.2528, pruned_loss=0.09189, ctc_loss=0.1643, over 635974.59 frames. ], batch size: 13, lr: 4.48e-03, [1,mpirank:1,algo-1]:2025-01-18 18:23:08,512 INFO [finetune.py:1142] (1/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3308.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.2542, pruned_loss=0.09245, ctc_loss=0.164, over 631021.03 frames. ], batch size: 14, lr: 4.48e-03, [1,mpirank:4,algo-1]:2025-01-18 18:23:08,514 INFO [finetune.py:1142] (4/8) Epoch 2, batch 5900, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3168.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.2521, pruned_loss=0.09174, ctc_loss=0.1649, over 639371.58 frames. ], batch size: 12, lr: 4.48e-03, [1,mpirank:3,algo-1]:2025-01-18 18:24:45,694 INFO [finetune.py:1142] (3/8) Epoch 2, batch 5950, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 3513.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.1978, pruned_loss=0.07229, ctc_loss=0.1286, over 637232.71 frames. ], batch size: 13, lr: 4.48e-03,

SSwethaSel0609 avatar Jan 21 '25 14:01 SSwethaSel0609

well if it's a git repo "git log -1" might tell you, if you are using a pip package then pip show icefall. but what directory did you find the scripts in, that also matters.

danpovey avatar Jan 22 '25 04:01 danpovey