E2FGVI
E2FGVI copied to clipboard
loss explosion when training on custom Dataset
Hi, it an awesome work! May I ask some help, I met some problems when training the model on REDS video dataset. When the training elapses about 40K iterations, the loss suddenly explode and the predict image became un-identifiable.
ps: the loss value show in picture is the sum of last 100 iterations
In order to run this dataset, I do the following modifications:
- Dataset: the frame_size=1280x72 100 frames video. I crop them to 256x256 and add random blur. I use 7 local frames and 5 reference frames (which is equally sample from whole video except the local frame region). My objective is to deblur so i do not use the mask to cover the origin image
- In order to train, I modify the SoftSplit and Tansformer's parameter:
output_size = (64, 64)
in this line andsmall_window_size = (11, 11)
to match the [12, 22, 22, 512] size feature out of Softsplit. - I set
no_dis: 1
in config file to not using the adversarial loss and gan_loss, I thought it may cause training unstable so I dismiss it - I only have one 24G-memory 4090 GPU so I could only set batchsize=1 and I did not change the scheduler which means the learning rate for the whole time is 1e-4.
the predict result at the loss-explosion iteration is like
ps: the first row: first 7 pic is local frames and latter 5 pic are non-local image; second row is correspinding GT. 3rd row is model's prediction
Does I mistakenly modified the param in TimeFocalTransformer? Have u guys have simiar issue and how u solve it, thanks.
Dear @LokiXun, have you solved this problem? The loss function increase at about 40k iterations.
Is the loss increase from the DCN layer in the training
Yes, the simple solution is resuming from a non-crashed checkpoint.
stayhungry1 @.***> 于2024年4月9日周二 15:06写道:
Is the loss increase from the DCN layer in the training
— Reply to this email directly, view it on GitHub https://github.com/MCG-NKU/E2FGVI/issues/75#issuecomment-2044285166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFATMT4W4KBMUT7UF2AORM3Y4OHNZAVCNFSM6AAAAAA4LMAUZ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGI4DKMJWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>