GravityZL

Results 6 comments of


                                            GravityZL

Launching train/train.py directly without Slurm

Look into the code: dinov2/distributed/__init__.py， simply changes the: `self.local_rank = int(os.environ["LOCAL_RANK"])` in method: `def _set_from_azure_env(self):` to: `self.local_world_size = torch.cuda.device_count()` then it works. Or as [vladchimescu](https://github.com/vladchimescu) mentioned, add one argument: `parser.add_argument("--local-rank",...

Launching train/train.py directly without Slurm

> @patricklabatut and @usryokousha Any reason to use `python -m torch.distributed.launch` over `torchrun` ? At least to the [pytorch documentation](https://pytorch.org/docs/stable/elastic/run.html) Torchrun offers more fault tolerance etc. both work, but `python...

Launching train/train.py directly without Slurm

> 请问可以公布多一点多GPU训练的修改细节吗，两种方式均尝试后，我的代码仍然只使用单张GPU进行训练 > > > 你好，我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。您遇到同样的问题吗？？ > Hi, I try to launch train/train.py directly without Slurm using your methods....

Launching train/train.py directly without Slurm

> Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one...

Launching train/train.py directly without Slurm

> I depends on your inter-node connectivity I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if...

Launching train/train.py directly without Slurm

> if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it...