LAVIS
LAVIS copied to clipboard
COCO Finetuning FlanT5 model loss is converge too fast, weird due to bfloat
023-03-06 00:57:00,090 [INFO] Start training epoch 0, 8855 iters per inner epoch.
Train: data epoch: [0] [ 0/8855] eta: 17:04:42 lr: 0.000000 loss: 0.7002 time: 6.9432 data: 0.0000 max mem: 15734
2023-03-06 00:57:07,036 [INFO] Reducer buckets have been rebuilt in this iteration.
Train: data epoch: [0] [ 50/8855] eta: 2:10:40 lr: 0.000001 loss: 0.6102 time: 0.7650 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 100/8855] eta: 2:00:50 lr: 0.000001 loss: 0.2055 time: 0.7601 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 150/8855] eta: 1:57:10 lr: 0.000002 loss: 0.0094 time: 0.7694 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 200/8855] eta: 1:55:06 lr: 0.000002 loss: 0.0025 time: 0.7658 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 250/8855] eta: 1:53:40 lr: 0.000003 loss: 0.0015 time: 0.7724 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 300/8855] eta: 1:52:37 lr: 0.000003 loss: 0.0007 time: 0.7819 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 350/8855] eta: 1:51:40 lr: 0.000004 loss: 0.0008 time: 0.7748 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 400/8855] eta: 1:50:35 lr: 0.000004 loss: 0.0003 time: 0.7605 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 450/8855] eta: 1:49:31 lr: 0.000005 loss: 0.0027 time: 0.7604 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 500/8855] eta: 1:48:33 lr: 0.000005 loss: 0.0005 time: 0.7578 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 550/8855] eta: 1:47:44 lr: 0.000006 loss: 0.0001 time: 0.7699 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 600/8855] eta: 1:46:54 lr: 0.000006 loss: 0.0002 time: 0.7597 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 650/8855] eta: 1:46:06 lr: 0.000007 loss: 0.0002 time: 0.7680 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 700/8855] eta: 1:45:19 lr: 0.000007 loss: 0.0002 time: 0.7581 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 750/8855] eta: 1:44:32 lr: 0.000008 loss: 0.0001 time: 0.7628 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 800/8855] eta: 1:43:47 lr: 0.000008 loss: 0.0001 time: 0.7611 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 850/8855] eta: 1:43:02 lr: 0.000009 loss: 0.0001 time: 0.7590 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 900/8855] eta: 1:42:24 lr: 0.000009 loss: 0.0003 time: 0.7755 data: 0.0000 max mem: 17659
Train: data epoch: [0] [ 950/8855] eta: 1:41:43 lr: 0.000010 loss: 0.0003 time: 0.7670 data: 0.0000 max mem: 17659
Train: data epoch: [0] [1000/8855] eta: 1:41:01 lr: 0.000010 loss: 0.0001 time: 0.7659 data: 0.0000 max mem: 17659
Train: data epoch: [0] [1050/8855] eta: 1:40:23 lr: 0.000010 loss: 0.0002 time: 0.7618 data: 0.0000 max mem: 17659
Train: data epoch: [0] [1100/8855] eta: 1:39:40 lr: 0.000010 loss: 0.0007 time: 0.7641 data: 0.0000 max mem: 17659
Since training of opt model and inference of flant5 works smootly, I expect this circumstance due to using bfloat in training code. However, LLM of BLIP2 is freezed so it will not affect to the training script as I understand. Also the output result of COCO test set is scoring less than 0.11 CIDEr. Learning rate or batch size is not the main problem in my personal opinion.
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
input_tokens = self.t5_tokenizer(
samples["text_input"] if modification else samples["answer"],
padding="longest",
truncation=True,
max_length=self.max_txt_len,
return_tensors="pt",
).to(image.device)
output_tokens = self.t5_tokenizer(
samples["text_input"] if modification else samples["answer"],
padding="longest",
truncation=True,
max_length=self.max_txt_len,
return_tensors="pt",
).to(image.device)
Is there any advice for this? My GPU is 4 x A6000
There was both error in original code and my custom code. It seems to work now
Train: data epoch: [0] [ 0/8855] eta: 14:06:07 lr: 0.000000 loss: 2.7854 time: 5.7332 data: 0.0000 max mem: 14759
2023-03-06 07:42:17,881 [INFO] Reducer buckets have been rebuilt in this iteration.
Train: data epoch: [0] [ 50/8855] eta: 2:05:07 lr: 0.000001 loss: 2.6419 time: 0.7409 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 100/8855] eta: 1:56:42 lr: 0.000001 loss: 2.3730 time: 0.7486 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 150/8855] eta: 1:53:39 lr: 0.000002 loss: 2.0816 time: 0.7492 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 200/8855] eta: 1:52:11 lr: 0.000002 loss: 1.8658 time: 0.7724 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 250/8855] eta: 1:51:03 lr: 0.000003 loss: 2.1898 time: 0.7700 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 300/8855] eta: 1:50:04 lr: 0.000003 loss: 1.6943 time: 0.7716 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 350/8855] eta: 1:49:12 lr: 0.000004 loss: 1.6969 time: 0.7561 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 400/8855] eta: 1:48:20 lr: 0.000004 loss: 1.6329 time: 0.7511 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 450/8855] eta: 1:47:30 lr: 0.000005 loss: 1.5707 time: 0.7461 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 500/8855] eta: 1:46:42 lr: 0.000005 loss: 1.6463 time: 0.7647 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 550/8855] eta: 1:45:48 lr: 0.000006 loss: 1.6603 time: 0.7478 data: 0.0000 max mem: 16397
Train: data epoch: [0] [ 600/8855] eta: 1:44:58 lr: 0.000006 loss: 1.5813 time: 0.7474 data: 0.0000 max mem: 16397
@SangbumChoi What was the issue with the original code? Could you maybe elaborate - we are keen to take a look and fix possibly.
@dxli94 I will leave or PR the code if the training process works properly! (might be tomorrow?)
Hi @SangbumChoi, may I know which precision do you use for vit? When I used "fp16", I got the error: Attempting to unscale FP16 gradients. Thanks!
@ilovecv I didn't use fp16 for finetuning COCO. Authors or Github issues say that you might use bfloat16 or float32 instead
@dxli94 CIDEr score is 121 in epoch 3. Does it seems OK to you? I might don't think so because original paper should score 144.5 in epoch 5. (However, this is the custom model with QFormer applying parameter efficient tuning such as Adapter.)
- Also I did some mistake of not excluding prompt token in the training stage :(
{"val": {"Bleu_1": 0.6923466000695367, "Bleu_2": 0.5463521311000465, "Bleu_3": 0.41042605285547745, "Bleu_4": 0.3020887975006222, "METEOR": 0.29250720864393315, "ROUGE_L": 0.5434937501409649, "CIDEr": 1.1303005509394353, "SPICE": 0.21982413568233614}}
{"val": {"Bleu_1": 0.7003870597792214, "Bleu_2": 0.5601019133516009, "Bleu_3": 0.4275028649956982, "Bleu_4": 0.3203032466533605, "METEOR": 0.2980155794880764, "ROUGE_L": 0.5541685319243635, "CIDEr": 1.1913805189882731, "SPICE": 0.2251951713503432}}
{"val": {"Bleu_1": 0.7063794637183775, "Bleu_2": 0.5663198030217094, "Bleu_3": 0.4336383248623231, "Bleu_4": 0.32570599539663775, "METEOR": 0.3028365303567551, "ROUGE_L": 0.5586304700025574, "CIDEr": 1.2094081702240678, "SPICE": 0.2305197545585627}}