dinov2 Launching train/train.py directly without Slurm

Hi, I am trying to launch dinov2/train/train.py script directly without the Slurm scheduler. I use the following command to launch the training:

export CUDA_VISIBLE_DEVICES=0,1 && python dinov2/train/train.py --config_file myconfig.yaml --output-dir my_outputdir

However, I can't seem to get it to work for training on multiple GPUs. I also tried using torchrun but haven't found the right argument combination.

I'm looking for a minimal example of launching train/train.py with FSDP, without the use of run/train.py. At the same time I'd like to enable multi-GPU training using FSDP.

Aug 14 '23 08:08 vladchimescu

just use

export CUDA_VISIBLE_DEVICES=0,1
export PYTHONPATH=absolute/workspace/directory
python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir --use-env

Aug 23 '23 06:08 usryokousha

@usryokousha Thanks! I managed to get it running without the --use-env flag:

export CUDA_VISIBLE_DEVICES=0,1 && python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir

In fact, feeding --use-env resulted in an error as it was an unrecognised argument to the script. I guess one could add it to the argparser.

By the way, I had to add the following in dinov2/train/train.py:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")

Multi-GPU training definitely works, but weirdly it shows the current_batch_size: 128.0000, which is my batch size per GPU. I would have expected for it to show 256 ( = 128 * 2 GPUs)?

Aug 23 '23 08:08 vladchimescu

It's just a logging issue, it displays the batch size per gpu; maybe we can put a better name

Aug 24 '23 15:08 qasfb

@patricklabatut and @usryokousha Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation Torchrun offers more fault tolerance etc.

Aug 29 '23 10:08 BenSpex

Look into the code: dinov2/distributed/init.py， simply changes the:

self.local_rank = int(os.environ["LOCAL_RANK"]) in method: def _set_from_azure_env(self): to: self.local_world_size = torch.cuda.device_count()

then it works.

Or as vladchimescu mentioned, add one argument:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")

and start your training with:

export CUDA_VISIBLE_DEVICES=xx,xx

Hope it helps :)

Aug 29 '23 13:08 GravityZL

@patricklabatut and @usryokousha Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation Torchrun offers more fault tolerance etc.

both work, but python -m torch.distributed.launch is/will be deprecated

Aug 29 '23 13:08 GravityZL

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

Sep 30 '23 04:09 Shizhen-ZHAO

请问可以公布多一点多GPU训练的修改细节吗，两种方式均尝试后，我的代码仍然只使用单张GPU进行训练

你好，我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。您遇到同样的问题吗？？

Oct 30 '23 07:10 GZ-YourZY

请问可以公布多一点多GPU训练的修改细节吗，两种方式均尝试后，我的代码仍然只使用单张GPU进行训练

你好，我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。您遇到同样的问题吗？？

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

Maybe you need to set the sampler type from INFINITE to DISTRIBUTED

Nov 03 '23 14:11 GravityZL

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

I had the same issue for multi node training but not for multi gpu within one node

Nov 08 '23 15:11 GravityZL

I depends on your inter-node connectivity

Nov 08 '23 15:11 qasfb

I depends on your inter-node connectivity

I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if I dont have slurm in the cluster, how can I enable the distributed training with the same speed(or at least comparable)? Thank you

Nov 08 '23 15:11 GravityZL

if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

Nov 08 '23 15:11 qasfb

It seems strange to me that it would be REALLY slow. It may not be a bad idea to use DISTRIBUTED instead of INFINITE due to some slow down per process in INFINITE.I wouldn’t expect a major difference though. You could just launch through SLURM however for single node.나의 iPhone에서 보냄2023. 11. 9. 오전 12:24, GravityZL @.***> 작성:

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

I had the same issue for multi node training but not for multi gpu within one node

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Nov 09 '23 11:11 usryokousha

if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

Thank you! I have solved the issue, it is indeed the cluster is not properly setup

Nov 14 '23 09:11 GravityZL

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

Apr 21 '24 22:04 adipill04

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

Same issue here @adipill04. I'm using torchrun:

torchrun --nproc_per_node=2 dinov2/train/train.py --config-file=<PATH_TO_YAML> --output-dir=<PATH_TO_OUTPUT>

It seems to only train on a single GPU. Did you find a solution for this?

May 08 '24 09:05 ahmed1996said

dinov2 dinov2 copied to clipboard

Launching train/train.py directly without Slurm

dinov2
dinov2 copied to clipboard