VLMEvalKit A Better Evaluation Entry for OpenAI-Style API Models

A Better Evaluation Entry for OpenAI-Style API Models

In practice, deployment and evaluation are typically separated to avoid dependency redundancy caused by supporting various models. The more models supported, the more complex the dependencies become. This approach also facilitates the evaluation of extremely large models.

I noticed that the framework supports OpenAI-style API models for evaluation through the GPT4V class. However, the user experience still needs improvement. Specifically:

To test a model, you need to modify config.py and register a model_name.
Diverse parameters (e.g., temperature, timeout) require manual adjustments.

Could you provide an interface like this?

python run_api.py \
  --model-name "vllm_qwen_2.5-7b" \
  --base-url "xxxxxxxx" \
  --api-key "xxxxx" \
  --max-token-out 16000 \
  --min-pixels 3k \
  --max-pixels 100w \
  --temperature 0.1 \
  --top-p 0.9 \
  --data MME \
  --work-dir ./outputs

With such an entry point, models like those in VLMEvalKit#1093 could be automatically supported without additional modifications. This would allow frameworks like vLLM, SGLang, and LMDeploy to support more models seamlessly.

在实际情况下，一般会部署和评测分离，这样可以摆脱支持各种各样的模型导致的依赖冗余，支持的模型越多，以来越复杂，以及更方便的支持超巨大模型的评测。而且一次也只会评测一个API, 多个模型就执行多次命令。

我看到框架里可以通过GPT4V 这个类来支持openai 格式的API模型的评测。但使用体验还是不够好。具体的：

测试一个模型，需要修改config.py，注册一个model_name
对多样性的参数需要手动修改，比如 temperature timeout

能否提供这样的接口：

python run_api.py \
  --model-name "vllm_qwen_2.5-7b" \
  --base-url "xxxxxxxx" \
  --api-key "xxxxx" \
  --max-token-out 16000 \
  --min-pixels 3k \
  --max-pixels 100w \
  --temperature 0.1 \
  --top-p 0.9 \
  --data MME \
  --work-dir ./outputs

有了这样的入口，类似：https://github.com/open-compass/VLMEvalKit/pull/1093 这样模型就可以自动支持。并不需要额外支持。vllm， sglang 以及 lmdeploy 会支持更多的模型。

Jun 20 '25 09:06 Ezra-Yu

并且使用 GPT4V 使用的环境变量会和 judge_model 的环境变量混合在一起，引起一些不必要的错误和失败。

Jun 20 '25 09:06 Ezra-Yu

I think it would be helpful to add a --provider parameter to make things more flexible when switching between APIs. Something like that:

python run_api.py \
  --model-name "vllm_qwen_2.5-7b" \
  --provider "OpenAI/Qwen/Claude/etc." \
  --api-key "xxxxx"

Jun 20 '25 09:06 D1m7asis

As a large model developer, I may prefer to test the models I develop more often, while for released models, I would rely on official results or run them only once.

For OpenAI-style models, the differences between providers mainly lie in the model_name, base_url and API_key. The --provider option would likely only help you reference environment variables. Of course, different models may require slightly different parameters—for example, OpenAI uses "detail" for image parameters, while QWE uses parameters like min_pixels and max_pixels, etc.

Jun 20 '25 10:06 Ezra-Yu

Any progress on this? I am not a developer, but I am searching for a good model / quantization for my home use. I wanted to use this project and evaluate models running with llama.cpp which offer openai compatible API. This would help me.

Nov 27 '25 12:11 vojtapolasek