OpenRLHF
OpenRLHF copied to clipboard
About using vLLM for generation
I have some thoughts about using vLLM for generation. Feel free to correct me if I were wrong.
- Batching
It seems that prompts are still passing to vllm engines in micro rollout batches during
make_experience
. However, passing all prompts to vllm engines all at once is very likely to imporve generation throughput. - Placement
The device placement of vLLM engines seems quite random. For example, this is what happens when running
examples/scripts/train_ppo_llama_ray_70b.sh
: run 1, matser node:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 780546 C ray::CriticModelRayActor 2806MiB |
| 1 N/A N/A 780769 C ray::CriticModelRayActor 2970MiB |
| 2 N/A N/A 780770 C ray::CriticModelRayActor 2990MiB |
| 3 N/A N/A 780771 C ray::CriticModelRayActor 2798MiB |
| 4 N/A N/A 781017 C ray::RewardModelRayActor 2530MiB |
| 5 N/A N/A 781612 C ray::RewardModelRayActor 2526MiB |
| 6 N/A N/A 787426 C ray::RayWorkerVllm 74344MiB |
| 7 N/A N/A 787427 C ray::RayWorkerVllm 74264MiB |
+-----------------------------------------------------------------------------+
run 2, matser node:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 531162 C ray::ActorModelRayActor.fit 2822MiB |
| 1 N/A N/A 531384 C ray::ActorModelRayActor.fit 3014MiB |
| 2 N/A N/A 531385 C ray::ActorModelRayActor.fit 3014MiB |
| 3 N/A N/A 531386 C ray::ActorModelRayActor.fit 2822MiB |
| 4 N/A N/A 531387 C ray::CriticModelRayActor 2824MiB |
| 5 N/A N/A 532043 C ray::CriticModelRayActor 3016MiB |
| 6 N/A N/A 532044 C ray::CriticModelRayActor 3016MiB |
| 7 N/A N/A 532045 C ray::CriticModelRayActor 2824MiB |
+-----------------------------------------------------------------------------+
To reduce communication overhead of broadcasting parameters from actor models to vllm engines, vllm engines and actor models are expected to be placed as close as possible. (e.g. on the same node) The ideal device placement of this training task would be:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 531162 C ray::ActorModelRayActor.fit 2822MiB |
| 1 N/A N/A 531384 C ray::ActorModelRayActor.fit 3014MiB |
| 2 N/A N/A 531385 C ray::ActorModelRayActor.fit 3014MiB |
| 3 N/A N/A 531386 C ray::ActorModelRayActor.fit 2822MiB |
| 4 N/A N/A 787426 C ray::RayWorkerVllm 74344MiB |
| 5 N/A N/A 787427 C ray::RayWorkerVllm 74264MiB |
| 6 N/A N/A 787426 C ray::RayWorkerVllm 74344MiB |
| 7 N/A N/A 787427 C ray::RayWorkerVllm 74264MiB |
+-----------------------------------------------------------------------------+