Training loss divergence

Open suhyeok-jang opened this issue 9 months ago • 1 comments

Hi, thank you for the great work!

I’m fine-tuning SVD on the provided checkpoint by simply changing the data using the script you provided. Initially, the loss decreases well, but at a certain point (about 3k steps, fixed learning rate : 1e-5), it suddenly starts to diverge. Have you ever experienced this issue during training? Could focal loss be the cause?

cd gcd-model/ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py
--base=configs/train_kubric_max90.yaml
--name=kb_v1 --seed=1234 --num_nodes=1 --wandb=0
model.base_learning_rate=2e-5
model.params.optimizer_config.params.foreach=False
data.params.dset_root=/path/to/Kubric-4D/data
data.params.pcl_root=/path/to/Kubric-4D/pcl
data.params.frame_width=384
data.params.frame_height=256
data.params.trajectory=interpol_linear
data.params.move_time=13
data.params.camera_control=spherical
data.params.batch_size=4
data.params.num_workers=4
data.params.data_gpu=0
lightning.callbacks.image_logger.params.batch_frequency=50
lightning.trainer.devices="1,2,3,4,5,6,7"

Mar 18 '25 14:03 suhyeok-jang

Thanks for your interest! I think you are on the right track -- in the default config, the default value for focus_steps is 5000 at which point the loss will upweight the top 10% wrong pixels, so at 3000 steps and beyond it could indeed be playing a big role. If you still think it is abnormal, could you perhaps share a screenshot of the loss curve? Also, I noticed you left the default parameter dset_root=/path/to/Kubric-4D/data, so I assume you already substituted this with your own dataloader elsewhere? Hope this helps!

Mar 19 '25 05:03 basilevh