Provide faster evaluation for vlmevalkit
The current model evaluation speed is greatly limited, especially the implementation method of multi-card vllm evaluation. The current VLLM implementation method actually only starts the model in a tensor parallel way, but the evaluation is essentially carried out serially (since it can only be started in python) The multi-process data splitting processing under torchrun was not even effective at all (the data was tested one by one), which completely failed to take advantage of the powerful concurrent performance of the vllm inference server. This is not conducive to use in the context of an increasing number of R1 paradigm inference models
I think we should refer to the practice of MLLM RL training, provide a new evaluation mode (or use parameters compatible with the previous evaluation mode), deploy the evaluation model as a separate evaluation server, and conduct the evaluation entirely through API calls. At the same time, the concurrency of the API can be controlled through the api_nproc parameter. This enables rapid testing, and the solution can be easily compatible with inference servers such as sglang
In fact, I have already completed a simple version of this idea by myself, supporting qwenvl2.5-7B and an inference model that we have not yet disclosed. If the author is interested, I can propose a PR to try to merge it into the repository. However, the workload of fully supporting all current models is quite large, which requires refactoring an interface for all models. Process the prompt and package it as a standard openai API request
I also welcome someone to work together on the refactoring. The specific implementation plan may need further discussion
Hi! I’m also trying to use vLLM with an API deployment to speed up evaluation. Would you share the code modifications you have made? That would be extremely helpful!Thank you!