orbit icon indicating copy to clipboard operation
orbit copied to clipboard

[Question] How to train by rl games with multi gpu in one machine?

Open Privilger opened this issue 3 months ago • 8 comments

Question

Hi,

The tutorial gives a script to show how to train rl policy using rl game framework. source/standalone/workflows/rl_games/train.py

But the example seems only works in one gpu.

I saw the rl game(https://github.com/Denys88/rl_games/tree/master) can use torchrun to leverage multi gpu, but how can I use torchrun here(the orbit framework). I try to run command : torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py, but seems not working.

Privilger avatar Mar 08 '24 08:03 Privilger

You have to use SLURM and design your code to leverage multiple GPUs, not all RL frameworks let u do it.

renanmb avatar Mar 08 '24 15:03 renanmb

I think this is their example on how to use the Singularity containers. The Docker container is OCI, Singularity is SIF. Different standards of containers, https://docs.sylabs.io/guides/3.5/user-guide/introduction.html

It is a little different and has its challenges, I hope this information helps you.

docker/cluster https://github.com/NVIDIA-Omniverse/orbit/tree/main/docker/cluster

renanmb avatar Mar 08 '24 15:03 renanmb

I think the methods you mentioned are about using cluster and container.

How about multi GPU in one workstation? The torchrun works well with rl games framework. But can not run well in orbit framework. So I think this is more about configuration to allow orbit works well with multiGPU

Privilger avatar Mar 08 '24 15:03 Privilger

This is about computer architecture, OS architecture and parallel computing architecture. multiGPU and even multi-core CPU is Parallel Computing regardless how you choose to orchestrate it. Because the particular way how Nvidia Omniverse Isaac Sim is designed and how Orbit works on it you should use SLURM for achieving the maximum performance.

renanmb avatar Mar 09 '24 02:03 renanmb

If multi GPU with multi machine, using container and SLURM might be an solution.

However, if the case is multi GPU in one machine, I believe SLURM is not the way to do that.

It is nothing to do with "multiGPU and even multi-core CPU" that you mentioned.

torchrun should solve the problem in this case. I just do not know how to figure it out yet.

This repo gives the example using torchrun with isaacsim https://github.com/NVIDIA-Omniverse/OmniIsaacGymEnvs

Privilger avatar Mar 11 '24 09:03 Privilger

image python source/standalone/workflows/rl_games/train.py --task Isaac-Reach-UR10-v0 --headless

Nimingez avatar Mar 11 '24 13:03 Nimingez

try task Isaac-Ant-v0

Privilger avatar Mar 11 '24 14:03 Privilger

my task is robot,ant is ok

Nimingez avatar Mar 12 '24 00:03 Nimingez