rai icon indicating copy to clipboard operation
rai copied to clipboard

feat: vlm bench warehouse tasks

Open MagdalenaKotynia opened this issue 4 months ago • 2 comments

Purpose

  • To extend vlm benchmark with images from warehouse simulation and with tasks of different type.

Proposed Changes

  • Added Multiple Choice tasks and Quantity tasks
  • Added more tasks with images from warehouse simulation
  • Added results summaries creation
    • Per task for all repeats within a model (tasks_summary.csv)
    • Per model for all repeats and all tasks (model_summary.csv)
    • For all models (benchmark_summary.csv)

Testing

If you want to use langfuse tracing, you need to do export LANGFUSE_MAX_EVENT_SIZE_BYTES=20000000 because some tasks take more than 1 MB space as tracing item in Langfuse.

To test single model:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama

To test many models:

from rai_bench import (
    VLMBenchmarkConfig,
    test_models,
)

if __name__ == "__main__":
    # Define models you want to benchmark
    model_names = ["gpt-4o", "gpt-4o-mini", "gemma3:4b", "gemma3:12b", "llava:7b", "llava:13b", "minicpm-v", "llama3.2-vision:11b", "llava-llama3:8b", "qwen2.5vl:3b", "qwen2.5vl:7b", "moondream:1.8b", "granite3.2-vision", "bakllava:7b", "llava-phi3:3.8b"]
    vendors = ["openai", "openai", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama"]

    vlm_bench_conf = VLMBenchmarkConfig(repeats=3)

    out_dir = "src/rai_bench/rai_bench/experiments"
    test_models(
        model_names=model_names,
        vendors=vendors,
        benchmark_configs=[vlm_bench_conf],
        out_dir=out_dir,
    )

Results

Results were collected from vlm models available through ollama that are smaller than 14b and on gpt-4o as reference. Results are below (count is number of tries per one task; bakllava:7b has less total_tasks because it gets stuck on some tasks - I tried to run it 3 times and this issue occured every time.)

merged_results_summary.csv

Results summary:

  • qwen2.5vl:7b has the best average success rate, but has a big average latency (avg_time)
  • qwen2.5vl:3b has imo the best trade-off between latency and success rate
  • quite good trade-off have also minicpm-v:8b and gemma3:4b

MagdalenaKotynia avatar Aug 12 '25 08:08 MagdalenaKotynia

Tracing has some bug, every proper trace is followed by random empty trace

jmatejcz avatar Sep 03 '25 09:09 jmatejcz

Todo(mm): Make sure images are lfs pointers.

maciejmajek avatar Sep 22 '25 10:09 maciejmajek