When training VIT-L on a large dataset, training Loss is not converging
I am training VIT-L on 4 A6000 GPUs with 256 batch size on each GPU. Gradient accumulation is set at default value of 1. The learning rate is also 0.001. With these settings, after few epochs the loss fluctuates between 7 and 8 and is not further converging. Here's my log, Please help me if you could find the solution to this. Any help is much appreciated, Thanks!
Training: 2024-02-28 11:08:30,501-rank_id: 0 Training: 2024-02-28 11:09:46,063-: margin_list [1.0, 0.0, 0.4] Training: 2024-02-28 11:09:46,063-: network vit_l_dp005_mask_005 Training: 2024-02-28 11:09:46,063-: resume True Training: 2024-02-28 11:09:46,063-: save_all_states True Training: 2024-02-28 11:09:46,063-: output
Training: 2024-02-28 11:09:46,063-: embedding_size 512 Training: 2024-02-28 11:09:46,063-: sample_rate 0.3 Training: 2024-02-28 11:09:46,063-: interclass_filtering_threshold0 Training: 2024-02-28 11:09:46,063-: fp16 True Training: 2024-02-28 11:09:46,064-: batch_size 1024 Training: 2024-02-28 11:09:46,064-: optimizer adamw Training: 2024-02-28 11:09:46,064-: lr 0.001 Training: 2024-02-28 11:09:46,064-: momentum 0.9 Training: 2024-02-28 11:09:46,064-: weight_decay 0.1 Training: 2024-02-28 11:09:46,064-: verbose 3000 Training: 2024-02-28 11:09:46,064-: frequent 10 Training: 2024-02-28 11:09:46,064-: dali False Training: 2024-02-28 11:09:46,064-: dali_aug False Training: 2024-02-28 11:09:46,064-: gradient_acc 1 Training: 2024-02-28 11:09:46,064-: seed 2048 Training: 2024-02-28 11:09:46,064-: num_workers 2 Training: 2024-02-28 11:09:46,064-: wandb_key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Training: 2024-02-28 11:09:46,064-: suffix_run_name None Training: 2024-02-28 11:09:46,064-: using_wandb False Training: 2024-02-28 11:09:46,064-: wandb_entity entity Training: 2024-02-28 11:09:46,064-: wandb_project project Training: 2024-02-28 11:09:46,064-: wandb_log_all True Training: 2024-02-28 11:09:46,064-: save_artifacts False Training: 2024-02-28 11:09:46,064-: wandb_resume False Training: 2024-02-28 11:09:46,064-: rec
Training: 2024-02-28 11:09:46,064-: num_classes
Training: 2024-02-28 11:09:46,064-: num_image
Training: 2024-02-28 11:09:46,064-: num_epoch 40 Training: 2024-02-28 11:09:46,064-: warmup_epoch 4 Training: 2024-02-28 11:09:46,064-: val_targets ['lfw', 'agedb_30'] Training: 2024-02-28 11:09:46,064-: total_batch_size 4096 Training: 2024-02-28 11:09:46,064-: warmup_step 64044 Training: 2024-02-28 11:09:46,064-: total_step 640440
@Suvi-dha Hi, wondering if you managed to solve this problem and willing to share your trained model?
Hi, you can experiment with gradient accumulation ( i increased it to 6). Also, changed the sample rate in between to observe the changes in loss convergence. You may also wait out till few more epochs to see if the loss drops.