LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

Error finetuning InstructBLIP flant5

Open evelinehong opened this issue 2 years ago • 11 comments

Hi, I'm trying to run blip2_t5_instruct.py (even though no script is shared right now) and encounters the following problem:

│ │ │ /gpfs/u/scratch/LMCG/LMCGnngn/LAVIS-hy/lavis/models/blip2_models/modeling_t5 │ │ .py:546 in forward │ │ │ │ 543 │ │ ) # (batch_size, n_heads, seq_length, dim_per_head) │ │ 544 │ │ │ │ 545 │ │ # get key/value states │ │ ❱ 546 │ │ key_states = project( │ │ 547 │ │ │ hidden_states, │ │ 548 │ │ │ self.k, │ │ 549 │ │ │ key_value_states, │ │ │ │ /gpfs/u/scratch/LMCG/LMCGnngn/LAVIS-hy/lavis/models/blip2_models/modeling_t5 │ │ .py:528 in project │ │ │ │ 525 │ │ │ elif past_key_value is None: │ │ 526 │ │ │ │ # cross-attn │ │ 527 │ │ │ │ # (batch_size, n_heads, seq_length, dim_per_head) │ │ ❱ 528 │ │ │ │ hidden_states = shape(proj_layer(key_value_states)) │ │ 529 │ │ │ │ │ 530 │ │ │ if past_key_value is not None: │ │ 531 │ │ │ │ if key_value_states is None: │ │ │ │ /gpfs/u/scratch/LMCG/LMCGnngn/LAVIS-hy/lavis/models/blip2_models/modeling_t5 │ │ .py:509 in shape │ │ │ │ 506 │ │ │ │ 507 │ │ def shape(states): │ │ 508 │ │ │ """projection""" │ │ ❱ 509 │ │ │ return states.view( │ │ 510 │ │ │ │ batch_size, -1, self.n_heads, self.key_value_proj_dim │ │ 511 │ │ │ ).transpose(1, 2) │ │ 512 │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: shape '[5, -1, 32, 64]' is invalid for input of size 376832

(it's able to predict single word but not whole sequence) When I'm using batch size 4. Is there something wrong in the inputs / config? Since we do not have an example here so I don't know what the correct ones should be.

evelinehong avatar Aug 15 '23 02:08 evelinehong

Marked.

qwqwq1445 avatar Aug 30 '23 08:08 qwqwq1445

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

Lishi905 avatar Nov 03 '23 07:11 Lishi905

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

Lishi905 avatar Nov 06 '23 12:11 Lishi905

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

qwqwq1445 avatar Dec 19 '23 12:12 qwqwq1445

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

Lishi905 avatar Dec 19 '23 12:12 Lishi905

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

In my case, I only use one GPU and the error occurs. Have you ever tried it on 1 single GPU?

qwqwq1445 avatar Dec 19 '23 12:12 qwqwq1445

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

In my case, I only use one GPU and the error occurs. Have you ever tried it on 1 single GPU?

would you mind leaving your parameter setting here? I'll give it a try recently and maybe I can find the reason :)

Lishi905 avatar Dec 19 '23 12:12 Lishi905

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

In my case, I only use one GPU and the error occurs. Have you ever tried it on 1 single GPU?

would you mind leaving your parameter setting here? I'll give it a try recently and maybe I can find the reason :)

1 GPU, batch_train=2, img_size=490, max_len=10, min_len=1, num_beams=5. The other parameters for the learning rate etc. are as the same as the paper. I REALLY appreciate your help, thank you VERY MUCH!

qwqwq1445 avatar Dec 19 '23 12:12 qwqwq1445

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

In my case, I only use one GPU and the error occurs. Have you ever tried it on 1 single GPU?

would you mind leaving your parameter setting here? I'll give it a try recently and maybe I can find the reason :)

1 GPU, batch_train=2, img_size=490, max_len=10, min_len=1, num_beams=5. The other parameters for the learning rate etc. are as the same as the paper. I REALLY appreciate your help, thank you VERY MUCH!

Hi, I'm working on this again recently and have tried your settings on vicuna7b instructblip. Everything goes well in the training process. I'm not sure whether the flan-t5 base model is the cause of the above error. Maybe you can have a try on vicuna :).

Lishi905 avatar Jan 17 '24 08:01 Lishi905

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

In my case, I only use one GPU and the error occurs. Have you ever tried it on 1 single GPU?

would you mind leaving your parameter setting here? I'll give it a try recently and maybe I can find the reason :)

1 GPU, batch_train=2, img_size=490, max_len=10, min_len=1, num_beams=5. The other parameters for the learning rate etc. are as the same as the paper. I REALLY appreciate your help, thank you VERY MUCH!

Hi, I'm working on this again recently and have tried your settings on vicuna7b instructblip. Everything goes well in the training process. I'm not sure whether the flan-t5 base model is the cause of the above error. Maybe you can have a try on vicuna :).

Thanks for your concern. I try Vicuna7b too and it does work. By the way, could you please tell me your result of zero-shot inference VQA-Abstract on Vicuna7b-BLIP2? I'll appreciate it if you could kindly help.

qwqwq1445 avatar Jan 17 '24 08:01 qwqwq1445

hello, have you solved the probelm? I have encountered the same problem and hope you can share some suggestions.

I just modified the parameter [--nproc_per_node] to be a even number and it works.

why this works? what is the cause for this error?

It seems that I employed 7 GPUs at first as the eighth GPU had been occupied for some other reasons. And I noticed that the invalid input size is just a multiples of 7, which could not be divided by the feature tensor shape, so I assigned just 6 GPUs and --nproc_per_node = 6 accordingly, and it worked.

Hi bro! @Lishi905 When finetuning the VQA task of BLIP2_T5XXL based on COCO and VG datasets, I also encountered the similar err: RuntimeError: shape '[10, -1, 64, 64]' is invalid for input of size 1441792 And I really don't know how to solve it. When dubugging, my batch_size=8 and I use 1 GPU and --nproc_per_node = 1. I noticed that you solved this error by employing 6GPU and --nproc_per_node = 6. Could you please share parameters in your config file? Is there any way to solve this by modifying the code? Looking forward to your reply and many thanks!!!

dszpr avatar Jan 19 '24 11:01 dszpr