Get N/A when evaluating lmms-lab/llava-onevision-qwen2-7b-ov on mathvista_testmini, docvqa_test, and infovqa_test
- mathvista_testmini:
{
"results": {
"mathvista_testmini": {
" ": " ",
"alias": "mathvista_testmini"
},
"mathvista_testmini_cot": {
"alias": " - mathvista_testmini_cot",
"gpt_eval_score,none": 29.2,
"gpt_eval_score_stderr,none": "N/A",
"submission,none": [],
"submission_stderr,none": []
},
"mathvista_testmini_format": {
"alias": " - mathvista_testmini_format",
"gpt_eval_score,none": 39.0,
"gpt_eval_score_stderr,none": "N/A",
"submission,none": [],
"submission_stderr,none": []
},
"mathvista_testmini_solution": {
"alias": " - mathvista_testmini_solution",
"gpt_eval_score,none": 35.7,
"gpt_eval_score_stderr,none": "N/A",
"submission,none": [],
"submission_stderr,none": []
}
},
- docvqa_test
"results": {
"docvqa_test": {
"alias": "docvqa_test",
"anls,none": [],
"anls_stderr,none": [],
"submission,none": null,
"submission_stderr,none": "N/A"
}
},
"group_subtasks": {
"docvqa_test": []
},
- infovqa_test
"results": {
"infovqa_test": {
"alias": "infovqa_test",
"submission,none": null,
"submission_stderr,none": "N/A"
}
},
"group_subtasks": {
"infovqa_test": []
},
@viyjy can you teach me how to evaluate? because i want to evaluate lmms-lab/LLaVA-Video-7B-Qwen2
@viyjy can you teach me how to evaluate? because i want to evaluate lmms-lab/LLaVA-Video-7B-Qwen2
Please follow this repo to do eval: https://github.com/EvolvingLMMs-Lab/lmms-eval
Did I do it right?
i try to use
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model llava_vid \
--model_args pretrained=lmms-lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=18,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
--tasks videomme \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_vid_32B \
--output_path ./logs/
resault
Model Responding: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2700/2700 [3:35:54<00:00, 4.80s/it]
Postprocessing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2700/2700 [00:00<00:00, 6466.64it/s]
2025-02-24 19:26:12.473 | INFO | utils:videomme_aggregate_results:314 - Evaluation on video Type: short: 71.7%
2025-02-24 19:26:12.474 | INFO | utils:videomme_aggregate_results:314 - Evaluation on video Type: medium: 56.9%
2025-02-24 19:26:12.474 | INFO | utils:videomme_aggregate_results:314 - Evaluation on video Type: long: 49.4%
2025-02-24 19:26:12.475 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Knowledge: 58.9%
2025-02-24 19:26:12.475 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Film & Television: 63.3%
2025-02-24 19:26:12.476 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Sports Competition: 58.2%
2025-02-24 19:26:12.476 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Artistic Performance: 60.6%
2025-02-24 19:26:12.477 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Life Record: 57.9%
2025-02-24 19:26:12.477 | INFO | utils:videomme_aggregate_results:323 - Evaluation on Categories: Multilingual: 57.8%
2025-02-24 19:26:12.478 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Humanity & History: 35.6%
2025-02-24 19:26:12.478 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Literature & Art: 54.4%
2025-02-24 19:26:12.479 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Biology & Medicine: 64.4%
2025-02-24 19:26:12.479 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Finance & Commerce: 64.4%
2025-02-24 19:26:12.480 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Astronomy: 67.8%
2025-02-24 19:26:12.480 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Geography: 50.0%
2025-02-24 19:26:12.481 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Law: 58.9%
2025-02-24 19:26:12.481 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Life Tip: 68.9%
2025-02-24 19:26:12.482 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Technology: 65.6%
2025-02-24 19:26:12.482 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Animation: 58.9%
2025-02-24 19:26:12.483 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Movie & TV Show: 62.2%
2025-02-24 19:26:12.483 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Documentary: 61.1%
2025-02-24 19:26:12.483 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: News Report: 71.1%
2025-02-24 19:26:12.484 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Esports: 53.3%
2025-02-24 19:26:12.484 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Basketball: 52.2%
2025-02-24 19:26:12.485 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Football: 63.3%
2025-02-24 19:26:12.485 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Athletics: 56.7%
2025-02-24 19:26:12.486 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Other Sports: 65.6%
2025-02-24 19:26:12.486 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Stage Play: 78.9%
2025-02-24 19:26:12.487 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Magic Show: 51.1%
2025-02-24 19:26:12.487 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Variety Show: 44.4%
2025-02-24 19:26:12.488 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Acrobatics: 67.8%
2025-02-24 19:26:12.488 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Handicraft: 66.7%
2025-02-24 19:26:12.488 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Food: 47.8%
2025-02-24 19:26:12.489 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Fashion: 55.6%
2025-02-24 19:26:12.489 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Daily Life: 56.7%
2025-02-24 19:26:12.490 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Travel: 56.7%
2025-02-24 19:26:12.490 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Pet & Animal: 73.3%
2025-02-24 19:26:12.491 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Exercise: 48.9%
2025-02-24 19:26:12.491 | INFO | utils:videomme_aggregate_results:332 - Evaluation on Video Sub Categories: Multilingual: 57.8%
2025-02-24 19:26:12.492 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Temporal Perception: 70.9%
2025-02-24 19:26:12.492 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Spatial Perception: 64.8%
2025-02-24 19:26:12.493 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Attribute Perception: 70.3%
2025-02-24 19:26:12.493 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Action Recognition: 60.4%
2025-02-24 19:26:12.494 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Object Recognition: 64.1%
2025-02-24 19:26:12.494 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: OCR Problems: 55.4%
2025-02-24 19:26:12.495 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Counting Problem: 39.6%
2025-02-24 19:26:12.496 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Temporal Reasoning: 42.9%
2025-02-24 19:26:12.496 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Spatial Reasoning: 82.1%
2025-02-24 19:26:12.497 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Action Reasoning: 52.6%
2025-02-24 19:26:12.497 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Object Reasoning: 56.6%
2025-02-24 19:26:12.498 | INFO | utils:videomme_aggregate_results:341 - Evaluation on Task Categories: Information Synopsis: 75.5%
2025-02-24 19:26:12.498 | INFO | utils:videomme_aggregate_results:348 - Overall Performance: 59.3%
2025-02-24 19:26:12.636 | INFO | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:188 - Saving results aggregated
2025-02-24 19:26:12.646 | INFO | lmms_eval.loggers.evaluation_tracker:save_results_samples:255 - Saving per-sample results for: videomme
llava_vid (pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=18,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|-------|------|-----:|-------------------------|---|------:|---|------|
|videomme|Yaml |none | 0|videomme_perception_score|↑ |59.3333|± | N/A|
Looks great:
reference: https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_Video_1003.md#evaluating-llava-video-on-multiple-datasets
@ZhangYuanhan-AI While I'm doing the eval, I get the mmco message: unref short failure. Does it have any negative effects on the eval?
No
When you do eval, do you get the same warning message as me?
@ixn3rd3mxn @ZhangYuanhan-AI What does your environment look like? I am having trouble setting up the lmms-eval. I have tried the following three ways:
- When I try to install lmms-eval using
cd lmms-eval
uv sync # This creates/updates your environment from uv.lock
this does not correctly sync my enviroment and I need to manually install a bunch of libraries
-
When restart and try and do
uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.gitinstead, this results in incompatibility between the transformers (which is upgraded to 4.47) and torch (which is 2.1). I upgrad torch to 2.2 and flash-attn accordingly, but then lmms-eval listing no tasks as being available. -
When I force the installation of a earlier version of lmms-eval 0.2.0, the tasks are correctly listed, but when I frun the command line provided in the docs to evaluate
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3 --tasks gqa,ai2d,chartqa,docvqa_val,mme,mmbench_en_dev --batch_size 1 --log_samples --log_samples_suffix llava_next --output_path /LLaVA-NeXT/logs/
I get the issue of flash-attn not being able to be imported.
@SStoica12 are u already to use conda?
@ixn3rd3mxn Thank you for your reply; yes, I am using conda with cuda driver 12.2
@SStoica12 u can try follow lıke me my spec : Python 3.10.16 Pytorch 2.1.2 Pytorch CUDA 12.1 CUDA Driver 12.6 gpu a30
script :
conda create -n YOUR_NAME_ENVIRONMENT python=3.10 -y
conda activate YOUR_NAME_ENVIRONMENT
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install --upgrade pip
( check cuda )
nvidia-smi
pip install -e .
pip install -e ".[train]"
conda install -c nvidia cuda-compiler
( check cuda compiler )
nvcc --version
pip install flash-attn==2.5.7
pip install lmms-eval
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval --model llava_vid --model_args pretrained=XXX,conv_template=XXX,video_decode_backend=decord,max_frames_num=XXX,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after --tasks videomme --batch_size 1 --log_samples --log_samples_suffix llava_vid_32B --output_path ./logs/
@ixn3rd3mxn Thank you; following now. My only concern is that pip install lmms-eval will install the newest version of lmms-eval which will upgrade the transformers to 4.47, which is incompatible with torch 2.1. What version of lmms-eval do you have? Or maybe, could you just run pip list and share with me the env? Or, can you run:
conda activate <their_environment_name>
conda env export > environment.yml
and share the environment.yml file with me?
@SStoica12 Oh, I forgot to tell you about my requirements. Sorry about that. Haha.
accelerate==1.3.0
aiofiles==24.1.0
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiosignal==1.3.2
altair==5.5.0
anyio==4.8.0
async-timeout==5.0.1
attrs==25.1.0
av==14.1.0
bitsandbytes==0.41.0
certifi==2024.12.14
charset-normalizer==3.4.1
click==8.1.8
contourpy==1.3.1
cycler==0.12.1
datasets==2.16.1
decord==0.6.0
deepspeed==0.14.4
dill==0.3.7
docker-pycreds==0.4.0
docstring_parser==0.16
einops==0.6.1
einops-exts==0.0.4
exceptiongroup==1.2.2
fastapi==0.115.7
ffmpy==0.5.0
filelock==3.17.0
flash-attn==2.5.7
fonttools==4.55.7
frozenlist==1.5.0
fsspec==2023.10.0
ftfy==6.3.1
gitdb==4.0.12
GitPython==3.1.44
gradio==3.35.2
gradio_client==0.2.9
h11==0.14.0
hf_transfer==0.1.9
hjson==3.1.0
httpcore==0.17.3
httpx==0.24.0
huggingface-hub==0.28.0
idna==3.10
Jinja2==3.1.5
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kiwisolver==1.4.8
latex2mathml==3.77.0
linkify-it-py==2.0.3
-e git+https://github.com/LLaVA-VL/LLaVA-NeXT@79ef45a6d8b89b92d7a8525f077c3a3a9894a87d#egg=llava
markdown-it-py==2.2.0
markdown2==2.5.3
MarkupSafe==3.0.2
matplotlib==3.10.0
mdit-py-plugins==0.3.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.15
narwhals==1.24.1
networkx==3.4.2
ninja==1.11.1.3
numpy==1.26.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.8.61
nvidia-nvtx-cu12==12.1.105
open_clip_torch==2.30.0
opencv-python==4.11.0.86
orjson==3.10.15
packaging==24.2
pandas==2.2.3
peft==0.4.0
pillow==11.1.0
platformdirs==4.3.6
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
pyarrow==19.0.0
pyarrow-hotfix==0.6
pydantic==1.10.8
pydub==0.25.1
Pygments==2.19.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
python-multipart==0.0.20
pytz==2024.2
PyYAML==6.0.2
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.22.3
safetensors==0.5.2
scikit-learn==1.2.2
scipy==1.15.1
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==2.20.0
setproctitle==1.3.4
shortuuid==1.0.13
shtab==1.7.1
six==1.17.0
smmap==5.0.2
sniffio==1.3.1
starlette==0.45.3
svgwrite==1.4.3
sympy==1.13.3
threadpoolctl==3.5.0
timm==1.0.14
tokenizers==0.15.2
torch==2.1.2
torchvision==0.16.2
tqdm==4.67.1
transformers @ git+https://github.com/huggingface/transformers.git@1c39974a4c4036fd641bc1191cc32799f85715a4
triton==2.1.0
typeguard==4.4.1
typing_extensions==4.12.2
tyro==0.9.13
tzdata==2025.1
uc-micro-py==1.0.3
urllib3==1.26.20
uvicorn==0.34.0
wandb==0.18.7
wavedrom==2.0.3.post3
wcwidth==0.2.13
websockets==14.2
xxhash==3.5.0
yarl==1.18.3
Other things you asked for I probably won't be able to give it. Because this project is on a server at the old company I used to work for. which I resigned from there Makes it impossible to provide more details about the lib version.
:D
@ixn3rd3mxn No worries and thank you :D
- I don't see the lmms-eval. Do you happen to remember what version you are using? Is it 0.3.0, 0.2.0, or 0.2.1?
- For flash-attn, did you use the wheels flash_attn-2.5.7+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl?
Hope you are enjoying your new job!
@SStoica12
1 Since the requirements.txt file I saved was created before I installed the lmms-eval library, it doesn’t include it.
2 I’m not really sure about this part — I just installed flash-attn==2.5.7 directly with pip install, nothing special. The installation went fine; it just took a really long time to download. (https://github.com/Dao-AILab/flash-attention/issues/1038#issuecomment-2563443121)