use-ray-with-slurm
use-ray-with-slurm copied to clipboard
A brief tutorial on how to use ray/rllib/tun in slurm cluster.
How to use ray with slurm cluster?
PENG Zhenghao
December 5, 2020
Quick Start
git clone https://github.com/pengzhenghao/use-ray-with-slurm.git
cd use-ray-with-slurm
# Please make sure you have install ray first!
python launch.py --exp-name test --command "echo 1"
# or a RLLib task:
python launch.py --exp-name test --command "rllib train --run PPO --env CartPole-v0"
The above command will launch a ray cluster within slurm cluster with 1 computing node.
Concretely, the launch.py does the following things:
- It automatically writes your requirements, e.g. number of CPUs, GPUs per node, the number of nodes and so on,
to a sbatch script name
{exp-name}_{date}-{time}.sh. In the above example, it istest_1205-1132.sh. Your command (--command) to launch your own job is also written into the sbatch script. - Then it will submit the sbatch script to slurm manager via a new process.
- Finally, the python process will terminate itself and leaves a log file named
{exp-name}_{date}-{time}.logto record the progress of your submitted command.
Specify number of computing nodes
If you want to utilize multiple computing node in slurm and let ray recognizes them, please use:
python launch.py --exp-name test --command "python your_file.py" --num-nodes 3
Specify computing nodes
If you want to specify the computing nodes, just use the same node name in sinfo command:
python launch.py --exp-name test --command "python your_file.py" --num-nodes 3 --node chpc-cn[003-005]
The list of all options
--exp-name: The experiment name. Will generate{exp-name}_{date}-{time}.shand{exp-name}_{date}-{time}.log.--command: The command you wish to run. For example:rllib train XXXorpython XXX.py.--num-gpus: The number of GPUs you wish to use in each computing node. Default: 0.--node(-w): The specify nodes you wish to use, in the same form of the return ofsinfo. Automatically assign if not specify.--num-nodes(-n): The number of nodes you wish to use. Default: 1.--partition(-p): The partition you wish to use. Default: "chpc" (CUHK cluster partition name, change to yours!)--load-env: The command to setup your environment. For example:module load cuda/10.1. Default: "".
The procedure
The sbatch script does the following things:
- It fetches the list of computing nodes and their IP addresses.
- It launches a head ray process in one of the node, and get the address of the head node.
- It launches ray processes in (n-1) worker nodes and connects them to the head node by providing the head node address.
- It submit the user specified task to ray.
Since all n nodes have launched their own ray processes, and they are all connected to the head node's ray process, ray cluster will perform resources allocation as in other cluster.
Misc
- It works well with ray 1.0.0, feel free to open issue if you find it doesn't work.
- Feel free to copy the script to your own projects.
- This script is compatible with both IPV4 and IPV6 ip address of the computing nodes.
- This project is inspired by Yet Another Slurm Python Interface and Ray sbatch submission scripts used at NERSC.