dinov2
dinov2 copied to clipboard
Launching train/train.py directly without Slurm
Hi,
I am trying to launch dinov2/train/train.py
script directly without the Slurm scheduler. I use the following command to launch the training:
export CUDA_VISIBLE_DEVICES=0,1 && python dinov2/train/train.py --config_file myconfig.yaml --output-dir my_outputdir
However, I can't seem to get it to work for training on multiple GPUs. I also tried using torchrun but haven't found the right argument combination.
I'm looking for a minimal example of launching train/train.py
with FSDP, without the use of run/train.py
. At the same time I'd like to enable multi-GPU training using FSDP.
just use
export CUDA_VISIBLE_DEVICES=0,1
export PYTHONPATH=absolute/workspace/directory
python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir --use-env
@usryokousha Thanks!
I managed to get it running without the --use-env
flag:
export CUDA_VISIBLE_DEVICES=0,1 && python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir
In fact, feeding --use-env
resulted in an error as it was an unrecognised argument to the script. I guess one could add it to the argparser.
By the way, I had to add the following in dinov2/train/train.py
:
parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")
Multi-GPU training definitely works, but weirdly it shows the current_batch_size: 128.0000
, which is my batch size per GPU. I would have expected for it to show 256 ( = 128 * 2 GPUs)?
It's just a logging issue, it displays the batch size per gpu; maybe we can put a better name
@patricklabatut and @usryokousha
Any reason to use python -m torch.distributed.launch
over torchrun
? At least to the pytorch documentation
Torchrun offers more fault tolerance etc.
Look into the code: dinov2/distributed/init.py, simply changes the:
self.local_rank = int(os.environ["LOCAL_RANK"])
in method: def _set_from_azure_env(self):
to: self.local_world_size = torch.cuda.device_count()
then it works.
Or as vladchimescu mentioned, add one argument:
parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")
and start your training with:
export CUDA_VISIBLE_DEVICES=xx,xx
Hope it helps :)
@patricklabatut and @usryokousha Any reason to use
python -m torch.distributed.launch
overtorchrun
? At least to the pytorch documentation Torchrun offers more fault tolerance etc.
both work, but python -m torch.distributed.launch
is/will be deprecated
Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?
请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练
你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?
请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练
你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?
Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?
Maybe you need to set the sampler type from INFINITE to DISTRIBUTED
Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?
I had the same issue for multi node training but not for multi gpu within one node
I depends on your inter-node connectivity
I depends on your inter-node connectivity
I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if I dont have slurm in the cluster, how can I enable the distributed training with the same speed(or at least comparable)? Thank you
if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests
i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster
It seems strange to me that it would be REALLY slow. It may not be a bad idea to use DISTRIBUTED instead of INFINITE due to some slow down per process in INFINITE.I wouldn’t expect a major difference though. You could just launch through SLURM however for single node.나의 iPhone에서 보냄2023. 11. 9. 오전 12:24, GravityZL @.***> 작성:
Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?
I had the same issue for multi node training but not for multi gpu within one node
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests
i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster
Thank you! I have solved the issue, it is indeed the cluster is not properly setup
I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks
I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks
Same issue here @adipill04. I'm using torchrun:
torchrun --nproc_per_node=2 dinov2/train/train.py --config-file=<PATH_TO_YAML> --output-dir=<PATH_TO_OUTPUT>
It seems to only train on a single GPU. Did you find a solution for this?