segan icon indicating copy to clipboard operation
segan copied to clipboard

How to resume training if interrupted

Open imran7778 opened this issue 6 years ago • 4 comments

I need help in how can i resume training if system shutdown or training interrupts. My training stopped due to system shutdown after that i execute the following command: bash train_segan.sh Its normally start and load checkpoints successfully but start training from zero not from previously saved checkpoints. Please guide me how can i resume training. Thanks

imran7778 avatar Apr 05 '18 07:04 imran7778

Hi @imran7778 ,

the latest checkpoint in the dir should load succesfully without any further work. Is it possible that the checkpoint is corrupt? Try modifying the 'checkpoint' text file within the directory to change the pointer to the latest but one file, thus telling TF to load a prior ckpt version. I'm not sure If I undrestand, however, what do you mean by normally start and load checkpoints successfully but start training from zero , how do you know it starts from zero? (I understand you've seen the verbose of [*] Load SUCCESFULLY).

Regards

santi-pdp avatar Apr 06 '18 07:04 santi-pdp

Dear @santi-pdp

Thank for your reply. here is the screen shot that may clear my point.

For the first time when i start training it give me the following output:

bash train_segan.sh 2018-04-10 10:04:24.617249: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-04-10 10:04:24.617354: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. Parsed arguments: {'z_depth': 256, 'l1_remove_epoch': 150, 'batch_size': 3, 'model': 'gan', 'init_l1_weight': 10.0, 'g_learning_rate': 0.0002, 'seed': 111, 'z_dim': 256, 'save_freq': 10, 'noise_decay': 0.7, 'denoise_epoch': 5, 'synthesis_path': 'dwavegan_samples', 'd_label_smooth': 0.25, 'weights': None, 'denoise_lbound': 0.01, 'epoch': 150, 'd_learning_rate': 0.0002, 'save_path': 'segan_v1', 'beta_1': 0.5, 'init_noise_std': 0.0, 'test_wav': None, 'e2e_dataset': 'data/segan.tfrecords', 'save_clean_path': 'test_clean_results', 'canvas_size': 16384, 'g_nl': 'prelu', 'g_type': 'ae'} Using device: /cpu:0 Creating GAN model *** Building Generator *** Downconv (3, 16384, 1) -> (3, 8192, 16) Adding skip connection downconv 0 -- Enc: prelu activation -- Downconv (3, 8192, 16) -> (3, 4096, 32) . . . . Amount of alpha vectors: 21 Amount of skip connections: 10 Last wave shape: (3, 16384, 1)


num of G returned: 23 *** Discriminator summary *** D block 0 input shape: (3, 16384, 2) *** downconved shape: (3, 8192, 16) *** Applying VBN *** Applying Lrelu *** . . . D block 10 input shape: (3, 16, 512) *** downconved shape: (3, 8, 1024) *** Applying VBN *** Applying Lrelu *** discriminator deconved shape: (3, 8, 1024) discriminator output shape: (3, 1)


Not clipping D weights Initializing optimizers... Initializing variables... Sampling some wavs to store sample references... sample noisy shape: (3, 16384) sample wav shape: (3, 16384) sample z shape: (3, 8, 1024) total examples in TFRecords data/segan.tfrecords: 360 Batches per epoch: 120.0 [*] Reading checkpoints... [!] Load failed 0/18000.0 (epoch 0), d_rl_loss = 1.42159, d_fk_loss = 0.02565, g_adv_loss = 5.51244, g_l1_loss = 6.08547, time/batch = 12.21935, mtime/batch = 12.21935 1/18000.0 (epoch 0), d_rl_loss = 1.40727, d_fk_loss = 10.28780, g_adv_loss = 2.06019, g_l1_loss = 5.75486, time/batch = 11.97167, mtime/batch = 12.09551 2/18000.0 (epoch 0), d_rl_loss = 5.54344, d_fk_loss = 9.00089, g_adv_loss = 5.41440, g_l1_loss = 6.22119, time/batch = 10.84464, mtime/batch = 11.67856 3/18000.0 (epoch 0), d_rl_loss = 2.56064, d_fk_loss = 0.67524, g_adv_loss = 110.04749, g_l1_loss = 5.88563, time/batch = 11.98766, mtime/batch = 11.75583 4/18000.0 (epoch 0), d_rl_loss = 43.09314, d_fk_loss = 32.41562, g_adv_loss = 18.53921, g_l1_loss = 6.13015, time/batch = 11.27476, mtime/batch = 11.65962 . . . 9/18000.0 (epoch 0), d_rl_loss = 16.02569, d_fk_loss = 12.40006, g_adv_loss = 8.71034, g_l1_loss = 5.61963, time/batch = 12.91840, mtime/batch = 11.64647 w0 max: 0.06945234537124634 min: -0.06775650382041931 w1 max: 0.051821060478687286 min: -0.04958131164312363 w2 max: 0.0637265294790268 min: -0.061875924468040466 10/18000.0 (epoch 0), d_rl_loss = 10.47512, d_fk_loss = 5.93869, g_adv_loss = 10.88952, g_l1_loss = 5.71833, time/batch = 11.29298, mtime/batch = 11.61434 11/18000.0 (epoch 0), d_rl_loss = 4.90630, d_fk_loss = 1.85100, g_adv_loss = 7.53411, g_l1_loss = 5.72742, time/batch = 11.91929, mtime/batch = 11.63975 12/18000.0 (epoch 0), d_rl_loss = 2.07515, d_fk_loss = 1.90992, g_adv_loss = 7.60952, g_l1_loss = 6.65654, time/batch = 13.04373, mtime/batch = 11.74775 13/18000.0 (epoch 0), d_rl_loss = 3.69959, d_fk_loss = 6.78575, g_adv_loss = 2.97328, g_l1_loss = 5.80335, time/batch = 11.46316, mtime/batch = 11.72742 14/18000.0 (epoch 0), d_rl_loss = 0.48384, d_fk_loss = 1.33486, g_adv_loss = 2.08532, g_l1_loss = 5.95979, time/batch = 12.65085, mtime/batch = 11.78898 . . . .

64/18000.0 (epoch 0), d_rl_loss = 0.14060, d_fk_loss = 0.06874, g_adv_loss = 0.48891, g_l1_loss = 6.13891, time/batch = 10.49836, mtime/batch = 11.53807 65/18000.0 (epoch 0), d_rl_loss = 0.12317, d_fk_loss = 0.05944, g_adv_loss = 1.03536, g_l1_loss = 4.85944, time/batch = 10.57506, mtime/batch = 11.52348 66/18000.0 (epoch 0), d_rl_loss = 0.20725, d_fk_loss = 0.19382, g_adv_loss = 1.22923, g_l1_loss = 4.45759, time/batch = 10.51389, mtime/batch = 11.50841 67/18000.0 (epoch 0), d_rl_loss = 0.06127, d_fk_loss = 0.01148, g_adv_loss = 0.97544, g_l1_loss = 4.50832, time/batch = 10.57977, mtime/batch = 11.49475 68/18000.0 (epoch 0), d_rl_loss = 0.09463, d_fk_loss = 0.06658, g_adv_loss = 0.54611, g_l1_loss = 4.63855, time/batch = 11.85356, mtime/batch = 11.49995 69/18000.0 (epoch 0), d_rl_loss = 0.49186, d_fk_loss = 0.22236, g_adv_loss = 0.57460, g_l1_loss = 3.07398, time/batch = 11.27534, mtime/batch = 11.49674 w0 max: 0.995848536491394 min: 0.0767320990562439 w1 max: 0.9888091087341309 min: 0.008043618872761726 w2 max: 0.9928516149520874 min: 0.041960734874010086 70/18000.0 (epoch 0), d_rl_loss = 0.03219, d_fk_loss = 0.09166, g_adv_loss = 0.54423, g_l1_loss = 6.05527, time/batch = 12.04599, mtime/batch = 11.50448 ^C 2018-04-10 10:21:41.574188: W tensorflow/core/kernels/queue_base.cc:294] _2_device_0/input_producer: Skipping cancelled enqueue attempt with queue not closed Traceback (most recent call last): return fn(*args) File "/home/imran/miniconda2/envs/ten/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _run_fn status, run_metadata) KeyboardInterrupt

After iteration Number 70/18000 i have interrupt the training myself. My save path looks like..

sss

and check point txt file looks like...

model_checkpoint_path: "SEGAN-70" all_model_checkpoint_paths: "SEGAN-30" all_model_checkpoint_paths: "SEGAN-40" all_model_checkpoint_paths: "SEGAN-50" all_model_checkpoint_paths: "SEGAN-60" all_model_checkpoint_paths: "SEGAN-70"

Now i have restart the training and expected to resume training from iteration 70/18000. but its start from iteration 0/18000. you can see in the following output:

bash train_segan.sh 2018-04-10 11:26:09.351613: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-04-10 11:26:09.351762: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. Parsed arguments: {'batch_size': 3, 'epoch': 150, 'd_learning_rate': 0.0002, 'save_clean_path': 'test_clean_results', 'model': 'gan', 'g_type': 'ae', 'denoise_epoch': 5, 'z_dim': 256, 'beta_1': 0.5, 'd_label_smooth': 0.25, 'g_learning_rate': 0.0002, 'canvas_size': 16384, 'weights': None, 'seed': 111, 'z_depth': 256, 'save_path': 'segan_v1', 'l1_remove_epoch': 150, 'e2e_dataset': 'data/segan.tfrecords', 'test_wav': None, 'init_l1_weight': 10.0, 'denoise_lbound': 0.01, 'synthesis_path': 'dwavegan_samples', 'g_nl': 'prelu', 'save_freq': 10, 'noise_decay': 0.7, 'init_noise_std': 0.0} Using device: /cpu:0 Creating GAN model *** Building Generator *** Downconv (3, 16384, 1) -> (3, 8192, 16) . . . . Not clipping D weights Initializing optimizers... Initializing variables... Sampling some wavs to store sample references... sample noisy shape: (3, 16384) sample wav shape: (3, 16384) sample z shape: (3, 8, 1024) total examples in TFRecords data/segan.tfrecords: 360 Batches per epoch: 120.0 [*] Reading checkpoints... [*] Read SEGAN-70 [*] Load SUCCESS 0/18000.0 (epoch 0), d_rl_loss = 0.02655, d_fk_loss = 0.27790, g_adv_loss = 1.29016, g_l1_loss = 4.26079, time/batch = 12.79920, mtime/batch = 12.79920 1/18000.0 (epoch 0), d_rl_loss = 0.08352, d_fk_loss = 0.03772, g_adv_loss = 0.44378, g_l1_loss = 5.76775, time/batch = 12.14758, mtime/batch = 12.47339 2/18000.0 (epoch 0), d_rl_loss = 0.15646, d_fk_loss = 0.02151, g_adv_loss = 1.40255, g_l1_loss = 3.38811, time/batch = 11.16680, mtime/batch = 12.03786 3/18000.0 (epoch 0), d_rl_loss = 0.04816, d_fk_loss = 0.29102, g_adv_loss = 0.99367, g_l1_loss = 6.05134, time/batch = 11.06146, mtime/batch = 11.79376 4/18000.0 (epoch 0), d_rl_loss = 0.13729, d_fk_loss = 0.17743, g_adv_loss = 1.32933, g_l1_loss = 4.26389, time/batch = 11.02163, mtime/batch = 11.63933 5/18000.0 (epoch 0), d_rl_loss = 0.19347, d_fk_loss = 0.04417, g_adv_loss = 0.68631, g_l1_loss = 4.05842, time/batch = 11.03287, mtime/batch = 11.53826 6/18000.0 (epoch 0), d_rl_loss = 0.10904, d_fk_loss = 0.00201, g_adv_loss = 1.50521, g_l1_loss = 5.40282, time/batch = 11.76548, mtime/batch = 11.57072 . . . . 28/18000.0 (epoch 0), d_rl_loss = 0.09597, d_fk_loss = 0.06830, g_adv_loss = 0.28432, g_l1_loss = 3.43519, time/batch = 18.37957, mtime/batch = 12.22184 29/18000.0 (epoch 0), d_rl_loss = 0.37213, d_fk_loss = 0.05943, g_adv_loss = 0.75423, g_l1_loss = 3.72193, time/batch = 11.14937, mtime/batch = 12.18609 w0 max: 0.5334341526031494 min: -0.27949172258377075 w1 max: 0.867225170135498 min: -0.07362376898527145 w2 max: 0.916520357131958 min: 0.20332744717597961 30/18000.0 (epoch 0), d_rl_loss = 0.17321, d_fk_loss = 0.30142, g_adv_loss = 0.30572, g_l1_loss = 3.94730, time/batch = 11.06777, mtime/batch = 12.15002 31/18000.0 (epoch 0), d_rl_loss = 0.01438, d_fk_loss = 0.06208, g_adv_loss = 0.78712, g_l1_loss = 2.60531, time/batch = 12.31825, mtime/batch = 12.15527 32/18000.0 (epoch 0), d_rl_loss = 0.12803, d_fk_loss = 0.06517, g_adv_loss = 0.78155, g_l1_loss = 4.04289, time/batch = 11.45150, mtime/batch = 12.13395 ^C 2018-04-10 11:34:56.431348: W tensorflow/core/kernels/queue_base.cc:294] _2_device_0/input_producer: Skipping cancelled enqueue attempt with queue not closed Traceback (most recent call last): status, run_metadata) KeyboardInterrupt

After restart traning my chectpoint txt file is also changed you can see below:

model_checkpoint_path: "SEGAN-30" all_model_checkpoint_paths: "SEGAN-10" all_model_checkpoint_paths: "SEGAN-20" all_model_checkpoint_paths: "SEGAN-30"

This is not my real training, the actual training contain big dataset and it was stopped at iteration 76000/90000 due to system shutdown after 3 days of continuous training. I know when i will restart training it will began from iteration 0/90000. Please help to how can i resume it...

Thanks

imran7778 avatar Apr 10 '18 07:04 imran7778

I am also facing same issue. After interruption, when I tried to retrain the model it is showing "LOAD SUCCESSFUL" but started with epoch 0/iteration 0. Please suggest any possible solution #46 The trained model is not loaded in code though it shows "LOAD SUCCESSFUL"

raikarsagar avatar May 03 '18 10:05 raikarsagar

After the final training, how much are d_rl_loss,d_fk_lossg_adv_loss and g_l1_loss respectively?I found that the loss of the training discriminator is very small, basically about 0.0005

fengqiyun avatar Mar 05 '19 01:03 fengqiyun