feat: vlm bench warehouse tasks
Purpose
- To extend vlm benchmark with images from warehouse simulation and with tasks of different type.
Proposed Changes
- Added Multiple Choice tasks and Quantity tasks
- Added more tasks with images from warehouse simulation
- Added results summaries creation
- Per task for all repeats within a model (tasks_summary.csv)
- Per model for all repeats and all tasks (model_summary.csv)
- For all models (benchmark_summary.csv)
Testing
If you want to use langfuse tracing, you need to do
export LANGFUSE_MAX_EVENT_SIZE_BYTES=20000000
because some tasks take more than 1 MB space as tracing item in Langfuse.
To test single model:
cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama
To test many models:
from rai_bench import (
VLMBenchmarkConfig,
test_models,
)
if __name__ == "__main__":
# Define models you want to benchmark
model_names = ["gpt-4o", "gpt-4o-mini", "gemma3:4b", "gemma3:12b", "llava:7b", "llava:13b", "minicpm-v", "llama3.2-vision:11b", "llava-llama3:8b", "qwen2.5vl:3b", "qwen2.5vl:7b", "moondream:1.8b", "granite3.2-vision", "bakllava:7b", "llava-phi3:3.8b"]
vendors = ["openai", "openai", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama"]
vlm_bench_conf = VLMBenchmarkConfig(repeats=3)
out_dir = "src/rai_bench/rai_bench/experiments"
test_models(
model_names=model_names,
vendors=vendors,
benchmark_configs=[vlm_bench_conf],
out_dir=out_dir,
)
Results
Results were collected from vlm models available through ollama that are smaller than 14b and on gpt-4o as reference. Results are below (count is number of tries per one task; bakllava:7b has less total_tasks because it gets stuck on some tasks - I tried to run it 3 times and this issue occured every time.)
Results summary:
- qwen2.5vl:7b has the best average success rate, but has a big average latency (avg_time)
- qwen2.5vl:3b has imo the best trade-off between latency and success rate
- quite good trade-off have also minicpm-v:8b and gemma3:4b
Tracing has some bug, every proper trace is followed by random empty trace
Todo(mm): Make sure images are lfs pointers.