DiscoGAN-pytorch icon indicating copy to clipboard operation
DiscoGAN-pytorch copied to clipboard

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2)

Open adrianalbert opened this issue 7 years ago • 0 comments

Hi,

I've been trying to run the example code (on the maps dataset):

python main.py --dataset=maps --num_gpu=4

I get the error below related to the NCCL library. I'm trying to run this on 4 K80 GPUs.

Any suggestions on what could be causing this and what a solution could be?

pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 178591.97it/s] pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 213732.43it/s] [] MODEL dir: logs/maps_2017-10-26_20-36-34 [] PARAM path: logs/maps_2017-10-26_20-36-34/params.json 0%| | 0/500000 [00:00<?, ?it/s]

Traceback (most recent call last): File "main.py", line 41, in main(config) File "main.py", line 33, in main trainer.train() File "/home/nbserver/DiscoGAN-pytorch/trainer.py", line 193, in train x_AB = self.G_AB(x_A).detach() File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 224, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 59, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 64, in replicate return replicate(module, device_ids) File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/replicate.py", line 12, in replicate param_copies = Broadcast(devices)(*params) File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py", line 19 , in forward outputs = comm.broadcast_coalesced(inputs, self.target_gpus) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 54, in broadcas t_coalesced results = broadcast(_flatten_tensors(chunk), devices) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 24, in broadcas t nccl.broadcast(tensors) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 182, in broadca st comm = communicator(inputs) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 133, in communi cator _communicators[key] = NcclCommList(devices) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 106, in _init _ check_error(lib.ncclCommInitAll(self, len(devices), int_array(devices))) File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 118, in check_e rror raise NcclError(status) torch.cuda.nccl.NcclError: System Error (2)

adrianalbert avatar Oct 26 '17 20:10 adrianalbert