Video-LLaMA icon indicating copy to clipboard operation
Video-LLaMA copied to clipboard

inf value occurs during forwarding process when fine-tuning VL branch with LLAVA-150K+MiniGPT4-3.5K+webvid-instruct

Open xuboshen opened this issue 6 months ago • 1 comments

Great works! But I've met some problems and hope anyone has some ideas.

When I fine-tune the VL branch only with LLaMA-2 on image/video instruction datas, inf values occurs and the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger.

Several attempts have been made:

  • [x] I have already checked the issue lists.
  • [x] I have consulted the huggingface forum and searched the google results.

Preparations:

My platform: 8*A6000 48G, the environment is setup exactly following the environment.yml in this repository. The data is prepared following LLaVa (coco), WebVid-10M and MiniGPT-4. 7B LLaMA-2 Pretrained weights are from this repo as well.

The demo correctly runs on remote platform, and training process seems correct. I did not modify any code here.

Problem

I found that some data can occur 'inf' numbers at the last layer of LLaMA-2, where the index of decoder layer number is 31 in the autoregressive loop in LLaMA-2. The error does not occurs immediately, instead, the value of torch.max(hidden_states) and torch.min(hidden_states) becomes larger and larger for positives / smaller and smaller for negatives.

-inf of hidden_states training

Do you or anyone have any ideas on why this problem occurs, and how to solve it? I appreciate anyone's time and help in advance.

xuboshen avatar Jan 04 '24 12:01 xuboshen

I actually try to set batchsize=1 and the training proceeds as expected, while batchsize=4 produces inf values and fails training.

Could anyone explain this phenomenon?

xuboshen avatar Jan 04 '24 16:01 xuboshen