vggt icon indicating copy to clipboard operation
vggt copied to clipboard

Training loss first decrease then increase

Open Learningm opened this issue 3 months ago • 4 comments

Hi, thanks for this excellent work.

I am trying to finetune vggt on my custom dataset, and i have three questions.

  1. I found the training loss first decrease then increase, is that normal?

The loss in tensorboard seems not very clear about the model converge status.

Image Image
  1. The batch loss Loss/train_loss_objective would be negative with the confidence setting, how to figure out whether the training coverage is good enough?

I tried to infer from my finetuning result on example data, but it seems performance degrade even though loss decreases.

Image
  1. The GPU memory usage seems to increase(another experiment) and tend to cause OOM on A100, how can I solve it?
INFO 2025-09-12 18:15:41,327 general.py: 117: Train Epoch: [0][    270/1000000] | Batch Time: 7.1827 (8.5635) | Data Time: 0.0378 (0.0897) | Mem (GB): 75.0000 (72.4391) | Time Elapsed: 00d 00h 39m | Loss/train_loss_objective: -0.5543 (0.0862) | Loss/train_loss_camera: 0.0073 (0.0619) | Loss/train_loss_T: 0.0033 (0.0345) | Loss/train_loss_R: 0.0011 (0.0128) | Loss/train_loss_FL: 0.0058 (0.0291) | Loss/train_loss_conf_depth: -0.5935 (-0.0643) | Loss/train_loss_reg_depth: 0.0075 (0.0834) | Loss/train_loss_grad_depth: 0.0026 (0.0142) | Grad/aggregator: 24.5968 (48.9545) | Grad/depth: 27.7247 (15.4777) | Grad/camera: 0.5437 (0.4361) | Grad/point: 113.9784 (111.4994)
INFO 2025-09-12 18:15:50,553 general.py: 117: Train Epoch: [0][    271/1000000] | Batch Time: 9.2262 (8.5659) | Data Time: 0.0345 (0.0895) | Mem (GB): 75.0000 (72.4485) | Time Elapsed: 00d 00h 39m | Loss/train_loss_objective: -0.0701 (0.0857) | Loss/train_loss_camera: 0.0496 (0.0618) | Loss/train_loss_T: 0.0314 (0.0345) | Loss/train_loss_R: 0.0024 (0.0128) | Loss/train_loss_FL: 0.0317 (0.0291) | Loss/train_loss_conf_depth: -0.0360 (-0.0646) | Loss/train_loss_reg_depth: 0.1406 (0.0834) | Loss/train_loss_grad_depth: 0.0058 (0.0142) | Grad/aggregator: 111.3372 (49.1838) | Grad/depth: 26.9166 (15.5197) | Grad/camera: 0.4601 (0.4362) | Grad/point: 249.7874 (112.0078)
INFO 2025-09-12 18:16:01,955 general.py: 117: Train Epoch: [0][    272/1000000] | Batch Time: 11.4019 (8.5763) | Data Time: 0.0344 (0.0893) | Mem (GB): 75.0000 (72.4579) | Time Elapsed: 00d 00h 39m | Loss/train_loss_objective: -0.4451 (0.0849) | Loss/train_loss_camera: 0.0118 (0.0617) | Loss/train_loss_T: 0.0063 (0.0344) | Loss/train_loss_R: 0.0013 (0.0128) | Loss/train_loss_FL: 0.0084 (0.0291) | Loss/train_loss_conf_depth: -0.4438 (-0.0652) | Loss/train_loss_reg_depth: 0.0141 (0.0832) | Loss/train_loss_grad_depth: 0.0034 (0.0142) | Grad/aggregator: 46.0385 (49.1723) | Grad/depth: 56.3996 (15.6695) | Grad/camera: 0.5516 (0.4366) | Grad/point: 353.5652 (112.8926)

Thanks for any suggestions!

Learningm avatar Sep 12 '25 10:09 Learningm

  1. The loss curve largely depends on your training set. While, I would suggest visualizing using a higher smoothness value in tensorboard.
  2. Yeah the loss should be negative, this is expected. Ideally the total loss should converge to a value like -0.5.
  3. Your log shows the memory is consistent at 75 GB right?

jytime avatar Sep 15 '25 12:09 jytime

I also got same issue, loss decreases but the inference give me degraded results even though loss decreases.

SamiraJahangiri avatar Sep 17 '25 03:09 SamiraJahangiri

  1. The loss curve largely depends on your training set. While, I would suggest visualizing using a higher smoothness value in tensorboard.
  2. Yeah the loss should be negative, this is expected. Ideally the total loss should converge to a value like -0.5.
  3. Your log shows the memory is consistent at 75 GB right?

@jytime Thanks for the reply. After tuning some hyperparamters and spending more time on the experiment, the training loss and validation loss can decrease and converge slowly without OOM.

I want to ask another question, my camera prediction loss (R & T & camera loss) seems to be low in the validation step, but the camera seems not accurate on my test sample.

The sample below is the input of four cameras of same object, front / back / left/ right four views, as you can see, the visible camera position is not correct. How to improve the camera accuracy?
For the preprocessing step, I remove background of the multiple input image, remaining the central object, which is same as my object-level dataset setting.

Thanks for any suggestions !

Image

Learningm avatar Sep 28 '25 15:09 Learningm

  1. The loss curve largely depends on your training set. While, I would suggest visualizing using a higher smoothness value in tensorboard.
  2. Yeah the loss should be negative, this is expected. Ideally the total loss should converge to a value like -0.5.
  3. Your log shows the memory is consistent at 75 GB right?

Hello, I would like to ask if I should set the smoothing value very high to take a look, like 0.99 or 0.999?

Shexiaox avatar Oct 16 '25 02:10 Shexiaox