VILA
VILA copied to clipboard
Inference not working - Keyword tensor should have 2 or 3 dimensions, got 1
I get the following error while running llava/eval/run_vila.py on a H100 gpu:
root@7513903dd8b0:/src/VILA# python -W ignore llava/eval/run_vila.py --model-path Efficient-Large-Model/VILA1.5-3b --conv-mode vicuna_v1 --query "<video>\n Please describe this video." --video-file "tjx1PPFsa6A-Scene-049.mp4"
Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 203142.93it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.47it/s]
no <image> tag found in input. Automatically append one at the beginning of text.
input: <image>
<image>
<image>
<image>
<image>
<image>
<video>\n Please describe this video.
[WARNING] the auto inferred conversation mode is llava_v0, while `--conv-mode` is vicuna_v1, using vicuna_v1
torch.Size([6, 3, 384, 384])
Traceback (most recent call last):
File "/src/VILA/llava/eval/run_vila.py", line 154, in <module>
eval_model(args)
File "/src/VILA/llava/eval/run_vila.py", line 116, in eval_model
output_ids = model.generate(
^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/src/VILA/llava/model/language_model/llava_llama.py", line 171, in generate
outputs = self.llm.generate(
^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 2924, in sample
if stopping_criteria(input_ids, scores):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in __call__
return any(criteria(input_ids, scores) for criteria in self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in <genexpr>
return any(criteria(input_ids, scores) for criteria in self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/src/VILA/llava/mm_utils.py", line 298, in __call__
outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/src/VILA/llava/mm_utils.py", line 279, in call_for_batch
raise ValueError(
ValueError: Keyword tensor should have 2 or 3 dimensions, got 1
torch version is 2.0.1+cu118 and flash attention 2.4.2
Sorry, we merged a PR yesterday and it was problematic. We just rolled back. Could you pull and try again?
also using torch 2.0.1+cu118 and flash attention 2.4.2 and got this error:
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Traceback (most recent call last):
File "/MarineAI/Nvidia-VILA/VILA/llava/eval/run_vila.py", line 154, in
Are you using llama3? If so, you need to pass --conv-mode=llama_3
sorry, I did not pay attention to this parameter... works now. Thanks a lot
@Efficient-Large-Language-Model Pulling the latest code worked for me. Thank you!