verl icon indicating copy to clipboard operation
verl copied to clipboard

Startup times are slow

Open casper-hansen opened this issue 9 months ago • 3 comments

The time it takes to startup a training using veRL with vLLM is quite extensive.

In my estimation, it takes 3.7 minutes to startup veRL, excluding the time to spinup Ray on multiple nodes.

Image

casper-hansen avatar Feb 25 '25 14:02 casper-hansen

Yeah, this is a known problem. What's the model size are you using?

vermouth1992 avatar Feb 26 '25 01:02 vermouth1992

This is with Qwen 2 7B

casper-hansen avatar Feb 26 '25 07:02 casper-hansen

Today with 4x nodes and Qwen 2.5 7B, I logged 6.7 minutes until step 1. @vermouth1992 Do you have any idea of which process in the init is taking so long and if we can optimize it?

casper-hansen avatar Feb 27 '25 09:02 casper-hansen

same

dddraxxx avatar Apr 14 '25 09:04 dddraxxx

I still have the same problem with the grpo example scripts https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2_5_vl-7b.sh

xk-huang avatar Jun 05 '25 10:06 xk-huang