VILA
VILA copied to clipboard
Error when running NVILA-8B-Video
When I evaluated NVILA-8B-Video on lmms-longvideobench with this script:
#!/bin/bash
set -e
MODEL_NAMES=(
"NVILA-8B-Video"
)
SELECTED_TASKS=(
"lmms-longvideobench_val_v"
)
TASK_STR=$(
IFS=,
echo "${SELECTED_TASKS[*]}"
)
echo "TASK_STR: $TASK_STR"
START_TIME=$(date +%s)
echo "START_TIME: $(date -d @"$START_TIME")"
for MODEL_NAME in "${MODEL_NAMES[@]}"; do
MODEL_ID="../my-models/models/$MODEL_NAME"
vila-eval \
--model-name "$MODEL_NAME" \
--model-path "$MODEL_ID" \
--conv-mode auto \
--tags-include local \
--nproc-per-node 2 \
--tasks "$TASK_STR" \
--output-dir "./runs/run-eval-20250112/$MODEL_NAME"
done
END_TIME=$(date +%s)
echo "END_TIME: $(date -d @"$END_TIME")"
echo "TIME_TAKEN: $((END_TIME - START_TIME)) seconds"
I've encountered this error:
2025-01-16 23:23:30.871 | WARNING | llava.utils.media:_load_video:59 - Failed to read frame 8014 from video '/data/yy/.cache/huggingface/longvideobench/videos/BktEeBeA7a8.mp4'. Skipped.
2025-01-16 23:23:30.875 | WARNING | llava.utils.media:_load_video:59 - Failed to read frame 8014 from video '/data/yy/.cache/huggingface/longvideobench/videos/BktEeBeA7a8.mp4'. Skipped.
Traceback (most recent call last):
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/__main__.py", line 329, in cli_evaluate
results, samples = cli_evaluate_single(args)
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/__main__.py", line 470, in cli_evaluate_single
results = evaluator.simple_evaluate(
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/utils.py", line 533, in _wrapper
return fn(*args, **kwargs)
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/evaluator.py", line 243, in simple_evaluate
results = evaluate(
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/utils.py", line 533, in _wrapper
return fn(*args, **kwargs)
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/lmms_eval/evaluator.py", line 457, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs) # Choiszt run generate until
File "/data/yy/anker/nvila/VILA/llava/eval/lmms/models/vila_internal.py", line 106, in generate_until
response = self.model.generate_content(prompt, generation_config=generation_config)
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/yy/anker/nvila/VILA/llava/model/llava_arch.py", line 834, in generate_content
output_ids = self.generate(
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/yy/anker/nvila/VILA/llava/model/llava_arch.py", line 783, in generate
inputs_embeds, _, attention_mask = self._embed(input_ids, media, media_config, None, attention_mask)
File "/data/yy/anker/nvila/VILA/llava/model/llava_arch.py", line 415, in _embed
media_embeds = self.__embed_media_tokens(media, media_config)
File "/data/yy/anker/nvila/VILA/llava/model/llava_arch.py", line 488, in __embed_media_tokens
embeds[name] = deque(self.encoders[name](media[name], media_config[name]))
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/yy/anaconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data/yy/anker/nvila/VILA/llava/model/encoders/video/tsp.py", line 64, in forward
return [process_features(f) for f in features]
File "/data/yy/anker/nvila/VILA/llava/model/encoders/video/tsp.py", line 64, in <listcomp>
return [process_features(f) for f in features]
File "/data/yy/anker/nvila/VILA/llava/model/encoders/video/tsp.py", line 41, in _process_features
features = pool(features, p, dim=dim)
File "/data/yy/anker/nvila/VILA/llava/model/encoders/video/tsp.py", line 12, in pool
return x.view(x.shape[:dim] + (-1, size) + x.shape[dim + 1 :]).mean(dim + 1)
RuntimeError: shape '[-1, 8, 16, 16, 3584]' is invalid for input of size 6422528
2025-01-16 23:23:31.233 | ERROR | __main__:cli_evaluate:348 - Error during evaluation: shape '[-1, 8, 16, 16, 3584]' is invalid for input of size 6422528. Please set `--verbosity=DEBUG` to get more information.
What is this TSPVideoEncoder? And how to avoid this error?
Hello, I encountered the same problem. Do you solve it? As for TSPVideoEncoder, I modify the config.json " video_encoder.target to the class in the file llava.model.encoders.tsp.TSPVideoEncoder with the correct module path. Though I modify it ,I have also the above promblem.