audiocraft icon indicating copy to clipboard operation
audiocraft copied to clipboard

grad_norm=INF during training a custom model

Open Kerrycarry opened this issue 11 months ago • 0 comments

Hi, I am training a model with Audiocraft. In the training logs, I noticed that starting from epoch 6, many training summaries report grad_norm=INF or grad_norm=NAN. I also observed that the model's generation quality stopped improving after epoch 6.

My question is: could the grad_norm=INF or grad_norm=NAN be the reason for the failure in learning?

Additionally, I’m curious why the training summary reports grad_norm=INF/NAN, while earlier logs within the same epoch show normal grad_norm values. For example, in the log below, Train Summary | Epoch 41 shows grad_norm=INF, but steps earlier, for example, Train | Epoch 41 | 1800/2000 shows grad_norm 2.766E-01. Is this expected behavior in Audiocraft?

Any help or insights would be greatly appreciated.

Below is the full training log for reference:

Dora directory: /root/autodl-tmp/audiocraft_root
[1mExecutor:[0m Starting 1 worker processes for DDP.
Dora directory: /root/autodl-tmp/audiocraft_root
/root/miniconda3/envs/audiocraft/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[[36m05-07 13:05:21[0m][[34mdora.distrib[0m][[32mINFO[0m] - world_size is 1, skipping init.[0m
[[36m05-07 13:05:21[0m][[34mflashy.solver[0m][[32mINFO[0m] - Instantiating solver CustomSolver for XP e6ef71eb[0m
[[36m05-07 13:05:21[0m][[34mflashy.solver[0m][[32mINFO[0m] - All XP logs are stored in /root/autodl-tmp/audiocraft_root/xps/e6ef71eb[0m
[[36m05-07 13:05:21[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split train: /root/autodl-tmp/audiocraft/egs/train[0m
[[36m05-07 13:05:25[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split valid: /root/autodl-tmp/audiocraft/egs/test[0m
[[36m05-07 13:05:26[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split evaluate: /root/autodl-tmp/audiocraft/egs/test[0m
[[36m05-07 13:05:32[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split generate: /root/autodl-tmp/audiocraft/egs/test[0m
[[36m05-07 13:05:33[0m][[34maudiocraft.optim.dadam[0m][[32mINFO[0m] - Using decoupled weight decay[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model hash: ec624482a1cfb5403b9af70056e28ebd4929ff48[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model size: 2.56 M params[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Base memory usage, with model, grad and optim: 0.04 GB[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Restoring weights and history.[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Loading existing checkpoint: /root/autodl-tmp/audiocraft_root/xps/e6ef71eb/checkpoint.th[0m
[[36m05-07 13:05:33[0m][[34maudiocraft.utils.checkpoint[0m][[32mINFO[0m] - Checkpoint loaded from /root/autodl-tmp/audiocraft_root/xps/e6ef71eb/checkpoint.th[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model hash: 0cdae0955aed30cc06f13a90a7e6f1851f44dd30[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - Replaying past metrics...[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 1 | lr=5.25E-02 | grad_norm=NAN | grad_scale=13213.696 | ce=2.455 | ppl=29.095 | duration=1320.333[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 1 | ce=0.232 | ppl=1.262 | duration=22.553[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 2 | lr=7.00E-02 | grad_norm=2.582E-01 | grad_scale=11370.496 | ce=0.112 | ppl=1.119 | duration=1320.138[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 2 | ce=0.111 | ppl=1.118 | duration=22.473[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 3 | lr=6.99E-02 | grad_norm=3.139E-01 | grad_scale=22740.992 | ce=0.109 | ppl=1.116 | duration=1319.697[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 3 | ce=0.109 | ppl=1.115 | duration=22.486[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 4 | lr=6.98E-02 | grad_norm=4.335E-01 | grad_scale=45481.984 | ce=0.106 | ppl=1.112 | duration=1320.841[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 4 | ce=0.104 | ppl=1.110 | duration=22.598[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 5 | lr=6.96E-02 | grad_norm=4.765E-01 | grad_scale=90963.968 | ce=0.102 | ppl=1.108 | duration=1321.813[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 5 | ce=0.101 | ppl=1.106 | duration=22.571[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 6 | lr=6.94E-02 | grad_norm=4.759E-01 | grad_scale=181927.936 | ce=0.100 | ppl=1.105 | duration=1321.872[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 6 | ce=0.098 | ppl=1.103 | duration=22.455[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 6 | rtf=0.003 | duration=9.851[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 7 | lr=6.91E-02 | grad_norm=NAN | grad_scale=274071.552 | ce=0.097 | ppl=1.102 | duration=1320.015[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 7 | ce=0.096 | ppl=1.101 | duration=22.691[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 8 | lr=6.88E-02 | grad_norm=INF | grad_scale=263192.576 | ce=0.096 | ppl=1.101 | duration=1321.922[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 8 | ce=0.094 | ppl=1.099 | duration=22.690[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 9 | lr=6.84E-02 | grad_norm=INF | grad_scale=299499.520 | ce=0.095 | ppl=1.099 | duration=1322.876[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 9 | ce=0.093 | ppl=1.097 | duration=22.628[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 10 | lr=6.80E-02 | grad_norm=INF | grad_scale=267780.096 | ce=0.094 | ppl=1.098 | duration=1323.885[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 10 | ce=0.092 | ppl=1.096 | duration=22.645[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 11 | lr=6.76E-02 | grad_norm=INF | grad_scale=295174.144 | ce=0.093 | ppl=1.098 | duration=1325.181[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 11 | ce=0.091 | ppl=1.096 | duration=22.697[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 12 | lr=6.70E-02 | grad_norm=INF | grad_scale=274202.624 | ce=0.092 | ppl=1.097 | duration=1322.134[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 12 | ce=0.091 | ppl=1.095 | duration=22.641[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 12 | rtf=0.003 | duration=6.592[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 13 | lr=6.65E-02 | grad_norm=INF | grad_scale=204406.784 | ce=0.092 | ppl=1.097 | duration=1321.559[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 13 | ce=0.091 | ppl=1.095 | duration=22.601[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 14 | lr=6.59E-02 | grad_norm=3.948E-01 | grad_scale=188809.216 | ce=0.092 | ppl=1.096 | duration=1321.441[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 14 | ce=0.090 | ppl=1.095 | duration=22.668[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 15 | lr=6.53E-02 | grad_norm=NAN | grad_scale=339476.480 | ce=0.092 | ppl=1.096 | duration=1321.579[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 15 | ce=0.090 | ppl=1.094 | duration=22.665[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 16 | lr=6.46E-02 | grad_norm=INF | grad_scale=276824.064 | ce=0.092 | ppl=1.096 | duration=1322.106[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 16 | ce=0.090 | ppl=1.094 | duration=22.598[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 17 | lr=6.38E-02 | grad_norm=NAN | grad_scale=272367.616 | ce=0.092 | ppl=1.096 | duration=1325.075[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 17 | ce=0.090 | ppl=1.094 | duration=22.590[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 18 | lr=6.31E-02 | grad_norm=3.683E-01 | grad_scale=275382.272 | ce=0.091 | ppl=1.096 | duration=1325.573[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 18 | ce=0.090 | ppl=1.094 | duration=22.630[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 18 | rtf=0.003 | duration=6.592[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 19 | lr=6.23E-02 | grad_norm=INF | grad_scale=276037.632 | ce=0.092 | ppl=1.096 | duration=1324.645[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 19 | ce=0.089 | ppl=1.094 | duration=22.698[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 20 | lr=6.14E-02 | grad_norm=3.537E-01 | grad_scale=140902.400 | ce=0.091 | ppl=1.096 | duration=1325.616[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 20 | ce=0.089 | ppl=1.093 | duration=22.603[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 21 | lr=6.05E-02 | grad_norm=INF | grad_scale=137691.136 | ce=0.091 | ppl=1.095 | duration=1326.366[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 21 | ce=0.089 | ppl=1.093 | duration=22.571[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 22 | lr=5.96E-02 | grad_norm=INF | grad_scale=172752.896 | ce=0.091 | ppl=1.095 | duration=1326.965[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 22 | ce=0.089 | ppl=1.093 | duration=22.578[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 23 | lr=5.87E-02 | grad_norm=3.373E-01 | grad_scale=213843.968 | ce=0.090 | ppl=1.095 | duration=1326.983[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 23 | ce=0.089 | ppl=1.093 | duration=22.585[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 24 | lr=5.77E-02 | grad_norm=NAN | grad_scale=155779.072 | ce=0.090 | ppl=1.095 | duration=1326.985[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 24 | ce=0.089 | ppl=1.093 | duration=22.559[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 24 | rtf=0.003 | duration=6.571[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 25 | lr=5.67E-02 | grad_norm=3.381E-01 | grad_scale=237436.928 | ce=0.090 | ppl=1.095 | duration=1327.496[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 25 | ce=0.089 | ppl=1.093 | duration=22.617[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mEvaluate Summary | Epoch 25 | duration=0.001[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 26 | lr=5.56E-02 | grad_norm=NAN | grad_scale=289800.192 | ce=0.090 | ppl=1.094 | duration=1329.473[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 26 | ce=0.089 | ppl=1.093 | duration=22.614[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 27 | lr=5.45E-02 | grad_norm=3.133E-01 | grad_scale=164495.360 | ce=0.090 | ppl=1.094 | duration=1331.303[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 27 | ce=0.089 | ppl=1.093 | duration=22.612[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 28 | lr=5.34E-02 | grad_norm=INF | grad_scale=133955.584 | ce=0.090 | ppl=1.094 | duration=1323.406[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 28 | ce=0.088 | ppl=1.092 | duration=22.554[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 29 | lr=5.23E-02 | grad_norm=2.988E-01 | grad_scale=259260.416 | ce=0.090 | ppl=1.094 | duration=1321.736[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 29 | ce=0.088 | ppl=1.092 | duration=22.562[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 30 | lr=5.12E-02 | grad_norm=2.916E-01 | grad_scale=518520.832 | ce=0.090 | ppl=1.094 | duration=1321.880[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 30 | ce=0.088 | ppl=1.092 | duration=22.643[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 30 | rtf=0.003 | duration=6.643[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 31 | lr=5.00E-02 | grad_norm=INF | grad_scale=264503.296 | ce=0.090 | ppl=1.094 | duration=1322.058[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 31 | ce=0.088 | ppl=1.092 | duration=22.587[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 32 | lr=4.88E-02 | grad_norm=INF | grad_scale=306315.264 | ce=0.090 | ppl=1.094 | duration=1322.332[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 32 | ce=0.088 | ppl=1.092 | duration=22.700[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 33 | lr=4.76E-02 | grad_norm=INF | grad_scale=289931.264 | ce=0.090 | ppl=1.094 | duration=1322.400[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 33 | ce=0.088 | ppl=1.092 | duration=22.636[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 34 | lr=4.63E-02 | grad_norm=NAN | grad_scale=380502.016 | ce=0.089 | ppl=1.093 | duration=1322.401[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 34 | ce=0.088 | ppl=1.092 | duration=22.659[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 35 | lr=4.51E-02 | grad_norm=INF | grad_scale=141230.080 | ce=0.089 | ppl=1.094 | duration=1322.428[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 35 | ce=0.088 | ppl=1.092 | duration=22.491[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 36 | lr=4.38E-02 | grad_norm=2.805E-01 | grad_scale=251985.920 | ce=0.089 | ppl=1.093 | duration=1321.824[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 36 | ce=0.088 | ppl=1.092 | duration=22.573[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mGenerate Summary | Epoch 36 | rtf=0.003 | duration=6.544[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 37 | lr=4.25E-02 | grad_norm=INF | grad_scale=300154.880 | ce=0.089 | ppl=1.093 | duration=1321.714[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 37 | ce=0.088 | ppl=1.092 | duration=22.529[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 38 | lr=4.12E-02 | grad_norm=INF | grad_scale=352845.824 | ce=0.089 | ppl=1.093 | duration=1322.031[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 38 | ce=0.088 | ppl=1.092 | duration=22.713[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 39 | lr=3.99E-02 | grad_norm=2.724E-01 | grad_scale=375259.136 | ce=0.089 | ppl=1.093 | duration=1321.859[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 39 | ce=0.087 | ppl=1.091 | duration=22.738[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 40 | lr=3.86E-02 | grad_norm=INF | grad_scale=350224.384 | ce=0.089 | ppl=1.093 | duration=1322.106[0m
[[36m05-07 13:05:33[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 40 | ce=0.087 | ppl=1.091 | duration=22.672[0m
/root/autodl-tmp/audiocraft/audiocraft/solvers/beatmapgen.py:378: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  note_code_maps = torch.tensor(note_code_maps, device=self.device)
/root/miniconda3/envs/audiocraft/lib/python3.9/site-packages/torch/cuda/__init__.py:866: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:816.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
[[36m05-07 13:07:56[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 200/2000 | 1.52 it/sec | lr 3.79E-02 | grad_norm 2.771E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:10:08[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 400/2000 | 1.53 it/sec | lr 3.78E-02 | grad_norm 2.728E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.094[0m
[[36m05-07 13:12:19[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 600/2000 | 1.53 it/sec | lr 3.76E-02 | grad_norm 2.866E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:14:30[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 800/2000 | 1.52 it/sec | lr 3.75E-02 | grad_norm 2.606E-01 | grad_scale 431226.880 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:16:41[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 1000/2000 | 1.52 it/sec | lr 3.74E-02 | grad_norm INF | grad_scale 460062.720 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:18:52[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 1200/2000 | 1.52 it/sec | lr 3.72E-02 | grad_norm 2.809E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.094[0m
[[36m05-07 13:21:03[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 1400/2000 | 1.52 it/sec | lr 3.71E-02 | grad_norm 2.733E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:23:14[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 1600/2000 | 1.52 it/sec | lr 3.70E-02 | grad_norm 2.806E-01 | grad_scale 262144.000 | ce 0.090 | ppl 1.094[0m
[[36m05-07 13:25:26[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 41 | 1800/2000 | 1.52 it/sec | lr 3.68E-02 | grad_norm 2.766E-01 | grad_scale 262144.000 | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:27:36[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mTrain Summary | Epoch 41 | lr=3.73E-02 | grad_norm=INF | grad_scale=298844.160 | ce=0.089 | ppl=1.093 | duration=1322.840[0m
[[36m05-07 13:27:47[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 5/50 | 3.91 it/sec | ce 0.098 | ppl 1.103[0m
[[36m05-07 13:27:49[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 10/50 | 3.92 it/sec | ce 0.083 | ppl 1.087[0m
[[36m05-07 13:27:50[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 15/50 | 3.92 it/sec | ce 0.089 | ppl 1.093[0m
[[36m05-07 13:27:51[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 20/50 | 3.92 it/sec | ce 0.092 | ppl 1.097[0m
[[36m05-07 13:27:52[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 25/50 | 3.93 it/sec | ce 0.083 | ppl 1.087[0m
[[36m05-07 13:27:54[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 30/50 | 3.93 it/sec | ce 0.080 | ppl 1.084[0m
[[36m05-07 13:27:55[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 35/50 | 3.93 it/sec | ce 0.082 | ppl 1.085[0m
[[36m05-07 13:27:56[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 40/50 | 3.93 it/sec | ce 0.098 | ppl 1.103[0m
[[36m05-07 13:27:58[0m][[34mflashy.solver[0m][[32mINFO[0m] - Valid | Epoch 41 | 45/50 | 3.93 it/sec | ce 0.078 | ppl 1.082[0m
[[36m05-07 13:27:59[0m][[34mflashy.solver[0m][[32mINFO[0m] - [1mValid Summary | Epoch 41 | ce=0.087 | ppl=1.091 | duration=22.477[0m
[[36m05-07 13:27:59[0m][[34mflashy.solver[0m][[32mINFO[0m] - New best state with ce=0.087 (was 0.087)[0m
[[36m05-07 13:27:59[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model hash: 88c4bca9ae08e2c36a973aff03f25cc875bbb4e9[0m
[[36m05-07 13:27:59[0m][[34maudiocraft.utils.checkpoint[0m][[32mINFO[0m] - Checkpoint saved to /root/autodl-tmp/audiocraft_root/xps/e6ef71eb/checkpoint.th[0m

Kerrycarry avatar May 07 '25 14:05 Kerrycarry