GravityZL

Results 6 comments of GravityZL

Look into the code: dinov2/distributed/__init__.py, simply changes the: `self.local_rank = int(os.environ["LOCAL_RANK"])` in method: `def _set_from_azure_env(self):` to: `self.local_world_size = torch.cuda.device_count()` then it works. Or as [vladchimescu](https://github.com/vladchimescu) mentioned, add one argument: `parser.add_argument("--local-rank",...

> @patricklabatut and @usryokousha Any reason to use `python -m torch.distributed.launch` over `torchrun` ? At least to the [pytorch documentation](https://pytorch.org/docs/stable/elastic/run.html) Torchrun offers more fault tolerance etc. both work, but `python...

> 请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练 > > > 你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ? > Hi, I try to launch train/train.py directly without Slurm using your methods....

> Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one...

> I depends on your inter-node connectivity I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if...

> if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it...