pytorch-distributed issues

Windows 提示Distributed package doesn't have NCCL "Distributed package doesn't have NCCL built in

1

是不是Windows不能用NCCL的backend呢？如果是这样，请问Windows 想用多GPU怎么解决呢？感谢！

Bump horovod from 0.18.2 to 0.24.0

Bumps [horovod](https://github.com/horovod/horovod) from 0.18.2 to 0.24.0. Release notes Sourced from horovod's releases. Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions Added Ray: Added elastic...

dependabot[bot]

dependencies

RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

2

按照你的脚本跑，一直报错，找不到原因。 ``` root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2 python distributed_slurm_main.py --dist-file dist_file Traceback (most recent call last): File "distributed_slurm_main.py", line 420, in main() File "distributed_slurm_main.py", line 131, in main mp.spawn(main_worker,...

poetryben88

how can i modify it ?

i want to modify it that let it can work on multi mechine , I don't kown how to do it?

FANG-MING

how can i modify it

FANG-MING

多卡显存占用不均衡

1

我run distributed.py ，发现显存占用不均衡，主卡占用10GB，另外3个卡占用8GB。请问怎么解决？

MachineVision123

关于损失backward问题

1

作者大大您好，为何代码中计算梯度的时候用的是loss.backward()而不是reduce_loss.backward() ?

menghuanlater

您好，我想请问一下Pytorch 在多进程分布式训练的时候，一开始加载数据集dataloader的时候，由于是多进程的，就会导致多个进程一起加载dataloader，这些dataloder全部加载到内存中使得内存爆炸，请问有没有什么办法解决呢

1456416403

大佬，请问如何指定gpu训练

例如，我在一张8卡节点上训练，想用其中4张训练如果我用0，1，2，3是可以训练的但是如果我用其他任意组合的gpuid就不可以我参考了这个把每个进程的gpuid 改了 https://github.com/PyTorchLightning/pytorch-lightning/issues/2407 会提示 `RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59` 我的代码 ``` import torch import torch.nn as nn import torch.distributed as...

ccijunk

请问使用DistributedSampler，各个GPU的数据是如何分配的？是连续(互不相同)的还是相同的？

1

我仿照了您的方法实现了一次分布式训练：发现单机单卡和多机多卡完成相同次数epoch的时间差不多，遂有所问。

fhong-jpg

pytorch-distributed
pytorch-distributed copied to clipboard

Metadata

Windows 提示Distributed package doesn't have NCCL "Distributed package doesn't have NCCL built in

Bump horovod from 0.18.2 to 0.24.0

RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

how can i modify it ?

how can i modify it

多卡显存占用不均衡

关于损失backward问题

您好，我想请问一下Pytorch 在多进程分布式训练的时候，一开始加载数据集dataloader的时候，由于是多进程的，就会导致多个进程一起加载dataloader，这些dataloder全部加载到内存中使得内存爆炸，请问有没有什么办法解决呢

大佬，请问如何指定gpu训练

请问使用DistributedSampler，各个GPU的数据是如何分配的？是连续(互不相同)的还是相同的？

← Metadata

Owner

Metadata

pytorch-distributed pytorch-distributed copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch-distributed
pytorch-distributed copied to clipboard