LAVIS
LAVIS copied to clipboard
Inability to reproduce BLIP2 VQAv2 finetune results
I tried to reproduce the finetuning results of BLIP2 FlanT5xl on VQAv2, but the results I got are far from those in the paper. I only got the highest accuracy of 76.58% while the paper is 81.55%, I want to figure out what's wrong with my code.
I modified the forward code according to this and I also added the Instruct implementation. My yaml configuration is as follows
model:
arch: blip2_t5
model_type: pretrain_flant5xl
load_pretrained: True
pretrained: '/share/datasets/blip2_pretrained_flant5xl.pth'
vit_model: eva_clip_g
# vit encoder
image_size: 400
drop_path_rate: 0
use_grad_checkpoint: False
vit_precision: "fp32"
freeze_vit: False
# Q-Former
num_query_token: 32
datasets:
coco_vqa:
vis_processor:
train:
name: "blip_image_train"
image_size: 400
eval:
name: "blip_image_eval"
image_size: 400
test:
name: "blip_image_eval"
image_size: 400
text_processor:
train:
name: "blip_question"
eval:
name: "blip_question"
test:
name: "blip_question"
vg_vqa: # name of the dataset builder
vis_processor:
train:
name: "blip_image_train"
image_size: 400
text_processor:
train:
name: "blip_question"
run:
task: vqa
# optimizer
lr_sched: "linear_warmup_cosine_lr"
init_lr: 1e-5
min_lr: 0
warmup_steps: 1000
warmup_lr: 1e-8
weight_decay: 0.05
max_epoch: 5
batch_size_train: 8
batch_size_eval: 32
num_workers: 4
accum_grad_iters: 1
lr_layer_decay: 0.95 # layer-wise learning rate decay for the ViT
max_len: 10
min_len: 1
num_beams: 5
inference_method: "generate"
prompt: "Question: {} Short answer:"
seed: 42
output_dir: "output/BLIP2_A100/flanT5_VQA"
amp: True
resume_ckpt_path: null
evaluate: False
train_splits: ["train"]
valid_splits: ["val"]
test_splits: ["val"]
device: "cuda"
world_size: 1
dist_url: "env://"
distributed: True
I really appreciate your great work and can you help me see where is the problem?
I also finetuned and carefully implemented all details from the paper, but got only 76.80. Had to reduce the image size due to computational costs but still would expect a better result even at 224 px.
I would kindly ask if you could upload the finetuned model T5+ViTG somewhere.
Thank you for all your valuable contributions to the field.