Kinetic-GAN icon indicating copy to clipboard operation
Kinetic-GAN copied to clipboard

Training Loss explodes when training on NTU-Dataset 120

Open shubhMaheshwari opened this issue 3 years ago • 8 comments

Hey @DegardinBruno

Great work. Thanks for sharing your code!

While training on the NTU-120 the generator loss is exploding. We only changed the batchsize from 32 to 380

[Epoch 297/1200] [Batch 162/165] [D loss: 0.198222] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 163/165] [D loss: 0.205873] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 164/165] [D loss: 0.223392] [G loss: 4918.882324]

Do you know why this could be happening ?

shubhMaheshwari avatar Jan 14 '22 14:01 shubhMaheshwari

Hello @shubhMaheshwari! Thanks!

GANs are very sensitive with hyperparameters even with batch sizes! Since we are using a Wasserstein loss with gradient penalty large batch sizes can affect the training!

Can you try with a smaller batch size like 32, 64 or 128? And then come back to us with your results? P.S. Also for better gradients remember using always a base 2 number as the batch size (2,4,8,16,32,64...), which had been already proven with GAN architectures.

DegardinBruno avatar Jan 14 '22 16:01 DegardinBruno

Hey @DegardinBruno, We tried again with 32 batch size but the loss for the generator is exploding.

[Epoch 648/1200] [Batch 1959/1969] [D loss: -0.036133] [G loss: -240803.156250]
[Epoch 648/1200] [Batch 1960/1969] [D loss: 0.047026] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1961/1969] [D loss: 0.007517] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1962/1969] [D loss: -0.058064] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1963/1969] [D loss: 0.203749] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1964/1969] [D loss: -0.175149] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1965/1969] [D loss: 0.327590] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1966/1969] [D loss: -0.348924] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1967/1969] [D loss: -0.372265] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1968/1969] [D loss: -0.245794] [G loss: -526808.625000]

We made only made the following changes to the code.

-parser.add_argument("--n_classes", type=int, default=60, help="number of classes for datas
+parser.add_argument("--n_classes", type=int, default=120, help="number of classes for data
-parser.add_argument("--checkpoint_interval", type=int, default=10000, help="interval betwe
+parser.add_argument("--checkpoint_interval", type=int, default=500, help="interval between
-parser.add_argument("--data_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/Br
-parser.add_argument("--label_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/B
+parser.add_argument("--data_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data/
+parser.add_argument("--label_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data

shubhMaheshwari avatar Jan 17 '22 09:01 shubhMaheshwari

@shubhMaheshwari I will repeat the experiments with the information you provided, I will come back to you with an answer. Just some questions:

  • Which benchmark are you using, cross-setup or cross-subject?
  • Which mapping network's depth did you define in the generator? If you are using the default (4) increase it at least to 8, the default one is for NTU-60 and NTU-120 has much more different subjects present in the training data (check paper for details).

Btw, Kinetic-GAN's loss on NTU-120 for the cross-setup benchmark should have similar behaviour as follows but values may vary due to random initializations. loss

DegardinBruno avatar Jan 17 '22 11:01 DegardinBruno

  1. We are using cross-subject
  2. We are using mapping network's depth = 4

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

python kinetic-gan.py  --data_path path_train_data.npy  --label_path path_train_labels.pkl  --dataset ntu_or_h36m  # check kinetic-gan.py file

Thanks Shubh

shubhMaheshwari avatar Jan 17 '22 12:01 shubhMaheshwari

Just one small thing, can you show me your loss evolution? Just run this, it will save a pdf plot on the respective exp folder:

python visualization/plot_loss.py --batches 1970 --runs kinetic-gan --exp -1

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

Here is the entire command that I am running:

python kinetic-gan.py --b1 0.5 --b2 0.999 --batch_size 32 --channels 3 --checkpoint_interval 10000 --data_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_data.npy  --dataset ntu --label_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_label.pkl --lambda_gp 10 --latent_dim 512 --lr 0.0002 --mlp_dim 8 --n_classes 120 --n_cpu 8 --n_critic 5 --n_epochs 1200 --sample_interval 5000 --t_size 64 --v_size 25

DegardinBruno avatar Jan 17 '22 14:01 DegardinBruno

@shubhMaheshwari This is my loss at this moment, which is normal to be high at the beginning and rapidly start to learn to generate the human structure before learning to synthesise human motion:

loss

DegardinBruno avatar Jan 18 '22 13:01 DegardinBruno

loss @DegardinBruno This is the loss curve we are getting. We didn't make any changes to the code

shubhMaheshwari avatar Jan 26 '22 14:01 shubhMaheshwari

@shubhMaheshwari did you downloaded the data from our server? I repeated the experiments a second time and nothing seems different from normal. What about torch versions also?

Try with ntu-60 xsub (default settings as the code base) to see if the same is happening, or even with fewer classes like 5 or 10 (feeder is ready for it also).

Feel free to reach me out.

DegardinBruno avatar Jan 28 '22 03:01 DegardinBruno