verl
verl copied to clipboard
Startup times are slow
The time it takes to startup a training using veRL with vLLM is quite extensive.
In my estimation, it takes 3.7 minutes to startup veRL, excluding the time to spinup Ray on multiple nodes.
Yeah, this is a known problem. What's the model size are you using?
This is with Qwen 2 7B
Today with 4x nodes and Qwen 2.5 7B, I logged 6.7 minutes until step 1. @vermouth1992 Do you have any idea of which process in the init is taking so long and if we can optimize it?
same
I still have the same problem with the grpo example scripts https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2_5_vl-7b.sh