LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Usage] LoRA finetuned weights provided for vicuna-13b-v1.3 gives NaN / inf error when performing inference on COCO-2014 questions after merging LoRA weights

Open DefUs3r opened this issue 1 year ago • 8 comments

Describe the issue

Issue:

We are trying to perform inference on the LoRA weights provided for vicuna-13b-v1.3 here. As mentioned by @haotian-liu in issue #245, we performing the merging step on the LoRA weights using the following command:

python merge_lora_weights.py \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3 \
    --model-base LLaVA/checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --save-model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE

After this, we perform the inference on 90 samples of COCO-2014 as mentioned in the paper using:

python -m llava.eval.model_vqa \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE \
    --question-file \
    LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    LLaVA/coco/coco_dataset/val2014 \
    --answers-file \
    LLaVA/model_inference_testing/coco/coco_val2014_answers-HF-vicuna-v1-3-13b-prompt-v1-test-merge.jsonl

This inference gives the following Error Log :

  0%|                                                                                                                                         | 0/90 [00:00<?, ?it/s]/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
  warnings.warn(
  0%|                                                                                                                                         | 0/90 [00:33<?, ?it/s]
Traceback (most recent call last):
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 112, in <module>
    eval_model(args)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 66, in eval_model
    output_ids = model.generate(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The python code we use to generate our model-base in merge_lora_weights.py is as follows :

python -m fastchat.model.apply_delta \
    --base huggyllama/llama-13b \
    --target checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --delta lmsys/vicuna-13b-v1.3

Interestingly, the same procedure when done for the LoRA-Merged weights returns :

all : 76.3
complex : 90.0
conv : 75.4
detail : 63.4

implying that merge_lora_weights.py either has some issue, or the provided LoRA weights have some issue, or the model-base is faulty.

Kindly suggest fixes for whatever is the reason for this error.

DefUs3r avatar Aug 30 '23 18:08 DefUs3r

I got the same error through the preview lora inference steps. link 截屏2023-09-13 09 32 28

wanghao-cst avatar Sep 13 '23 01:09 wanghao-cst

I also got the same error when using my own fine-tuned model to inference.

Cubism-star avatar Sep 15 '23 01:09 Cubism-star

Describe the issue

Issue:

We are trying to perform inference on the LoRA weights provided for vicuna-13b-v1.3 here. As mentioned by @haotian-liu in issue #245, we performing the merging step on the LoRA weights using the following command:

python merge_lora_weights.py \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3 \
    --model-base LLaVA/checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --save-model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE

After this, we perform the inference on 90 samples of COCO-2014 as mentioned in the paper using:

python -m llava.eval.model_vqa \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE \
    --question-file \
    LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    LLaVA/coco/coco_dataset/val2014 \
    --answers-file \
    LLaVA/model_inference_testing/coco/coco_val2014_answers-HF-vicuna-v1-3-13b-prompt-v1-test-merge.jsonl

This inference gives the following Error Log :

  0%|                                                                                                                                         | 0/90 [00:00<?, ?it/s]/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
  warnings.warn(
  0%|                                                                                                                                         | 0/90 [00:33<?, ?it/s]
Traceback (most recent call last):
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 112, in <module>
    eval_model(args)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 66, in eval_model
    output_ids = model.generate(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The python code we use to generate our model-base in merge_lora_weights.py is as follows :

python -m fastchat.model.apply_delta \
    --base huggyllama/llama-13b \
    --target checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --delta lmsys/vicuna-13b-v1.3

Interestingly, the same procedure when done for the LoRA-Merged weights returns :

all : 76.3
complex : 90.0
conv : 75.4
detail : 63.4

implying that merge_lora_weights.py either has some issue, or the provided LoRA weights have some issue, or the model-base is faulty.

Kindly suggest fixes for whatever is the reason for this error.

Hi, have you fixed the issue?

wanghao-cst avatar Sep 21 '23 02:09 wanghao-cst

Describe the issue

Issue: We are trying to perform inference on the LoRA weights provided for vicuna-13b-v1.3 here. As mentioned by @haotian-liu in issue #245, we performing the merging step on the LoRA weights using the following command:

python merge_lora_weights.py \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3 \
    --model-base LLaVA/checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --save-model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE

After this, we perform the inference on 90 samples of COCO-2014 as mentioned in the paper using:

python -m llava.eval.model_vqa \
    --model-path hf_checkpoints/llava-v1-0719-336px-lora-vicuna-13b-v1.3-MERGE \
    --question-file \
    LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    LLaVA/coco/coco_dataset/val2014 \
    --answers-file \
    LLaVA/model_inference_testing/coco/coco_val2014_answers-HF-vicuna-v1-3-13b-prompt-v1-test-merge.jsonl

This inference gives the following Error Log :

  0%|                                                                                                                                         | 0/90 [00:00<?, ?it/s]/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
  warnings.warn(
  0%|                                                                                                                                         | 0/90 [00:33<?, ?it/s]
Traceback (most recent call last):
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 112, in <module>
    eval_model(args)
  File "/home/workspace/cgy/LLAVA/LLaVA/llava/eval/model_vqa.py", line 66, in eval_model
    output_ids = model.generate(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/home/anaconda3/envs/llavacuda6/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The python code we use to generate our model-base in merge_lora_weights.py is as follows :

python -m fastchat.model.apply_delta \
    --base huggyllama/llama-13b \
    --target checkpoints/fastchat_llama-vicuna-v1-3-13b \
    --delta lmsys/vicuna-13b-v1.3

Interestingly, the same procedure when done for the LoRA-Merged weights returns :

all : 76.3
complex : 90.0
conv : 75.4
detail : 63.4

implying that merge_lora_weights.py either has some issue, or the provided LoRA weights have some issue, or the model-base is faulty. Kindly suggest fixes for whatever is the reason for this error.

Hi, have you fixed the issue?

No this is not yet fixed.

DefUs3r avatar Oct 07 '23 18:10 DefUs3r

how did you download the dataset coco/coco_dataset/val2014?

terminator123 avatar Dec 14 '23 08:12 terminator123

how did you download the dataset coco/coco_dataset/val2014?

Do you know how to download coco_val2014 now?

kuaileqipaoshui avatar Jan 06 '24 15:01 kuaileqipaoshui

@Cubism-star

I also got the same error when using my own fine-tuned model to inference.

Me too. Did you fix it?

Ryosuke0104 avatar Jan 20 '24 06:01 Ryosuke0104

any update ? i also face same issue after finetune not able to merge

Kamleshpaul avatar Feb 27 '24 05:02 Kamleshpaul

why nobody fix it?

ChenRan2000 avatar Apr 19 '24 02:04 ChenRan2000