LAVIS COCO Finetuning FlanT5 model loss is converge too fast, weird due to bfloat

023-03-06 00:57:00,090 [INFO] Start training epoch 0, 8855 iters per inner epoch.
Train: data epoch: [0]  [   0/8855]  eta: 17:04:42  lr: 0.000000  loss: 0.7002  time: 6.9432  data: 0.0000  max mem: 15734
2023-03-06 00:57:07,036 [INFO] Reducer buckets have been rebuilt in this iteration.
Train: data epoch: [0]  [  50/8855]  eta: 2:10:40  lr: 0.000001  loss: 0.6102  time: 0.7650  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 100/8855]  eta: 2:00:50  lr: 0.000001  loss: 0.2055  time: 0.7601  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 150/8855]  eta: 1:57:10  lr: 0.000002  loss: 0.0094  time: 0.7694  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 200/8855]  eta: 1:55:06  lr: 0.000002  loss: 0.0025  time: 0.7658  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 250/8855]  eta: 1:53:40  lr: 0.000003  loss: 0.0015  time: 0.7724  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 300/8855]  eta: 1:52:37  lr: 0.000003  loss: 0.0007  time: 0.7819  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 350/8855]  eta: 1:51:40  lr: 0.000004  loss: 0.0008  time: 0.7748  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 400/8855]  eta: 1:50:35  lr: 0.000004  loss: 0.0003  time: 0.7605  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 450/8855]  eta: 1:49:31  lr: 0.000005  loss: 0.0027  time: 0.7604  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 500/8855]  eta: 1:48:33  lr: 0.000005  loss: 0.0005  time: 0.7578  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 550/8855]  eta: 1:47:44  lr: 0.000006  loss: 0.0001  time: 0.7699  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 600/8855]  eta: 1:46:54  lr: 0.000006  loss: 0.0002  time: 0.7597  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 650/8855]  eta: 1:46:06  lr: 0.000007  loss: 0.0002  time: 0.7680  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 700/8855]  eta: 1:45:19  lr: 0.000007  loss: 0.0002  time: 0.7581  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 750/8855]  eta: 1:44:32  lr: 0.000008  loss: 0.0001  time: 0.7628  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 800/8855]  eta: 1:43:47  lr: 0.000008  loss: 0.0001  time: 0.7611  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 850/8855]  eta: 1:43:02  lr: 0.000009  loss: 0.0001  time: 0.7590  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 900/8855]  eta: 1:42:24  lr: 0.000009  loss: 0.0003  time: 0.7755  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [ 950/8855]  eta: 1:41:43  lr: 0.000010  loss: 0.0003  time: 0.7670  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [1000/8855]  eta: 1:41:01  lr: 0.000010  loss: 0.0001  time: 0.7659  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [1050/8855]  eta: 1:40:23  lr: 0.000010  loss: 0.0002  time: 0.7618  data: 0.0000  max mem: 17659
Train: data epoch: [0]  [1100/8855]  eta: 1:39:40  lr: 0.000010  loss: 0.0007  time: 0.7641  data: 0.0000  max mem: 17659

Since training of opt model and inference of flant5 works smootly, I expect this circumstance due to using bfloat in training code. However, LLM of BLIP2 is freezed so it will not affect to the training script as I understand. Also the output result of COCO test set is scoring less than 0.11 CIDEr. Learning rate or batch size is not the main problem in my personal opinion.

        with torch.cuda.amp.autocast(dtype=torch.bfloat16):
            input_tokens = self.t5_tokenizer(
                samples["text_input"] if modification else samples["answer"],
                padding="longest",
                truncation=True,
                max_length=self.max_txt_len,
                return_tensors="pt",
            ).to(image.device)
            output_tokens = self.t5_tokenizer(
                samples["text_input"] if modification else samples["answer"],
                padding="longest",
                truncation=True,
                max_length=self.max_txt_len,
                return_tensors="pt",
            ).to(image.device)

Is there any advice for this? My GPU is 4 x A6000

Mar 06 '23 01:03 SangbumChoi

There was both error in original code and my custom code. It seems to work now

Train: data epoch: [0]  [   0/8855]  eta: 14:06:07  lr: 0.000000  loss: 2.7854  time: 5.7332  data: 0.0000  max mem: 14759
2023-03-06 07:42:17,881 [INFO] Reducer buckets have been rebuilt in this iteration.
Train: data epoch: [0]  [  50/8855]  eta: 2:05:07  lr: 0.000001  loss: 2.6419  time: 0.7409  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 100/8855]  eta: 1:56:42  lr: 0.000001  loss: 2.3730  time: 0.7486  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 150/8855]  eta: 1:53:39  lr: 0.000002  loss: 2.0816  time: 0.7492  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 200/8855]  eta: 1:52:11  lr: 0.000002  loss: 1.8658  time: 0.7724  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 250/8855]  eta: 1:51:03  lr: 0.000003  loss: 2.1898  time: 0.7700  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 300/8855]  eta: 1:50:04  lr: 0.000003  loss: 1.6943  time: 0.7716  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 350/8855]  eta: 1:49:12  lr: 0.000004  loss: 1.6969  time: 0.7561  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 400/8855]  eta: 1:48:20  lr: 0.000004  loss: 1.6329  time: 0.7511  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 450/8855]  eta: 1:47:30  lr: 0.000005  loss: 1.5707  time: 0.7461  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 500/8855]  eta: 1:46:42  lr: 0.000005  loss: 1.6463  time: 0.7647  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 550/8855]  eta: 1:45:48  lr: 0.000006  loss: 1.6603  time: 0.7478  data: 0.0000  max mem: 16397
Train: data epoch: [0]  [ 600/8855]  eta: 1:44:58  lr: 0.000006  loss: 1.5813  time: 0.7474  data: 0.0000  max mem: 16397

Mar 06 '23 08:03 SangbumChoi

@SangbumChoi What was the issue with the original code? Could you maybe elaborate - we are keen to take a look and fix possibly.

Mar 06 '23 08:03 dxli94

@dxli94 I will leave or PR the code if the training process works properly! (might be tomorrow?)

Mar 06 '23 10:03 SangbumChoi

Hi @SangbumChoi, may I know which precision do you use for vit? When I used "fp16", I got the error: Attempting to unscale FP16 gradients. Thanks!

Mar 06 '23 22:03 ilovecv

@ilovecv I didn't use fp16 for finetuning COCO. Authors or Github issues say that you might use bfloat16 or float32 instead

Mar 07 '23 23:03 SangbumChoi

@dxli94 CIDEr score is 121 in epoch 3. Does it seems OK to you? I might don't think so because original paper should score 144.5 in epoch 5. (However, this is the custom model with QFormer applying parameter efficient tuning such as Adapter.)

Also I did some mistake of not excluding prompt token in the training stage :(

{"val": {"Bleu_1": 0.6923466000695367, "Bleu_2": 0.5463521311000465, "Bleu_3": 0.41042605285547745, "Bleu_4": 0.3020887975006222, "METEOR": 0.29250720864393315, "ROUGE_L": 0.5434937501409649, "CIDEr": 1.1303005509394353, "SPICE": 0.21982413568233614}}
{"val": {"Bleu_1": 0.7003870597792214, "Bleu_2": 0.5601019133516009, "Bleu_3": 0.4275028649956982, "Bleu_4": 0.3203032466533605, "METEOR": 0.2980155794880764, "ROUGE_L": 0.5541685319243635, "CIDEr": 1.1913805189882731, "SPICE": 0.2251951713503432}}
{"val": {"Bleu_1": 0.7063794637183775, "Bleu_2": 0.5663198030217094, "Bleu_3": 0.4336383248623231, "Bleu_4": 0.32570599539663775, "METEOR": 0.3028365303567551, "ROUGE_L": 0.5586304700025574, "CIDEr": 1.2094081702240678, "SPICE": 0.2305197545585627}}

Mar 07 '23 23:03 SangbumChoi

LAVIS LAVIS copied to clipboard

COCO Finetuning FlanT5 model loss is converge too fast, weird due to bfloat

LAVIS
LAVIS copied to clipboard