LLaVA
LLaVA copied to clipboard
[Question] Fine-Tuning LLaVA v1.5-7B lora on Custom Dataset and RuntimeError in Model Evaluation
Question
Hello LLaVA Team,
I've been working on fine-tuning the LLaVA v1.5-7B model on a custom dataset using the provided finetune_task_lora.sh
script. Here is the configuration I used:
bash scripts/v1_5/finetune_task_lora.sh
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed ./scripts/zero3_offload.json \
--model_name_or_path liuhaotian/llava-v1.5-7b \
--version v1 \
--data_path /workspace/Dataset/train.json \
--image_folder ./ \
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./checkpoints/llava-v1.5-7b-task-lora \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
After training, these were the results:
{'train_runtime': 25078.2556, 'train_samples_per_second': 1.595, 'train_steps_per_second': 0.1, 'train_loss': 0.16062020410320182, 'epoch': 1.0}
When attempting to evaluate the model using model_vqa.py
, I encountered a runtime error. The model loads correctly, but during the evaluation, I receive a RuntimeError: probability tensor contains either 'inf', 'nan' or element < 0.
python llava/eval/model_vqa.py --model-path checkpoints/llava-v1.5-7b-task-lora/ --model-base checkpoints/llava-v1.5-7b/ --question-file Dataset/eval_ques .jsonl --image-folder ./ --answers-file /workspace/Dataset/eval_answer.jsonl
Here's the traceback:
Loading LLaVA from base model...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.14s/it]
Loading additional LLaVA weights...
Loading LoRA weights...
Merging LoRA weights...
Model is loaded...
0%| | 0/2108 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/workspace/llava/eval/model_vqa.py", line 125, in <module>
eval_model(args)
File "/workspace/llava/eval/model_vqa.py", line 66, in eval_model
output_ids = model.generate(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
It seems that the model's hidden state outputs are all nan.
BaseModelOutputWithPast(last_hidden_state=tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0',
dtype=torch.float16), past_key_values=((tensor([[[[-1.7842, 0.6445, 0.9375, ..., -1.8057, 1.9531, -2.0898],
[-0.0104, -0.5850, 0.0938, ..., 0.6738, -0.7227, 0.6948],
[ 1.1318, 3.6055, -0.5405, ..., -2.7910, 4.2617, -2.5684],
...,
[ 0.8984, 3.0488, 0.4187, ..., -2.5312, 3.5684, -2.3633],
[ 0.6392, 1.5225, -0.4216, ..., -0.7524, 1.3262, -0.6392],
[-0.0755, 0.3811, 0.0687, ..., 0.4375, -0.8335, 0.4001]],
[[-0.4683, 1.1523, 0.1193, ..., -0.9561, 1.3672, -0.9609],
[ 0.0293, 0.4148, -0.1000, ..., -0.3264, -0.1302, -0.1819],
[ 2.3184, -2.0352, 1.4316, ..., -1.8232, 2.5410, -1.8711],
...,
[ 1.9541, -1.8486, -1.5303, ..., -1.3906, 2.0664, -1.4053],
[ 0.7222, -0.9507, -0.5615, ..., -0.4160, 0.9175, -0.4568],
[ 0.4136, -1.2432, 0.7637, ..., 1.0225, -0.6924, 1.0254]],
[[-0.9062, -2.0078, -0.9204, ..., -0.7695, -0.4683, -0.3967],
[-0.0662, 0.4402, 0.1777, ..., 1.8486, 1.9346, 1.8145],
[ 1.2881, 0.0416, 0.6675, ..., -2.3965, -2.6621, -2.6934],
...,
[ 0.1807, -1.1299, -0.6143, ..., -2.4707, -2.8301, -2.9219],
[-1.0068, 0.0765, -0.5195, ..., -1.5000, -1.7148, -1.7324],
[ 0.2966, -0.6782, 0.0466, ..., 1.0801, 1.2441, 1.1377]],
...,
Could you help me understand what might be causing this issue and how to resolve it? Thank you very much
@rorubyy I have run scripts like you. But I encount new issues.
How I should fix the max model sequence len and fix the problem about uploading many images? Here is my test-json:
@rorubyy What size is your custom dataset? I'm curious about its performance with smaller datasets.
hi @rorubyy
Were you able to figure out the reason for the hidden states being nan? I'm facing the same issue
Hello @rorubyy @chanangad
I am also facing the same issue. Does anyone have a solution or any ideas on how to fix it?
I encountered the same issue while running model_vqa.py
with a fine-tuned 7b model.
I used to have the same issue and I figured it was because I was using hugging face's "llava-hf/llava-1.5-7b-hf" as the base model. I switched the base to "liuhaotian/llava-v1.5-7b" and it resolved the NaN issue. Plus, the training performance got much better.