PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

How did this error happen when i run example about resnet?

Open lengien opened this issue 2 years ago • 0 comments

root@6496cf66be1e:/workspace/PiPPy/examples/resnet# python pippy_resnet.py -s=1F1B [PiPPy] World size: 5, DP group size: 1, PP group size: 5 rank = 4 host/pid/device = 6496cf66be1e/2823/cuda:4 [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address). rank = 1 host/pid/device = 6496cf66be1e/2820/cuda:1 [W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address). rank = 0 host/pid/device = 6496cf66be1e/2819/cuda:0 rank = 2 host/pid/device = 6496cf66be1e/2821/cuda:2 rank = 3 host/pid/device = 6496cf66be1e/2822/cuda:3 REPLICATE config: 0 -> MultiUseParameterConfig.TRANSMIT Using schedule: 1F1B Using device: cuda:0 Files already downloaded and verified Epoch: 1 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1250/1250 [02:38<00:00, 7.89it/s] Loader: train. Accuracy: 0.3269 Epoch: 2 65%|████████████████████████████████████████████████████████████████████████████▉ | 815/1250 [01:39<00:51, 8.43it/s][W CUDAGuardImpl.h:124] Warning: CUDA warning: device-side assert triggered (function destroyEvent) [W CUDAGuardImpl.h:124] Warning: CUDA warning: device-side assert triggered (function destroyEvent) /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. [W tensorpipe_agent.cpp:678] RPC agent for worker4 encountered error when sending response to request #97107 to worker3: device-side assert triggered (this error originated at tensorpipe/common/cuda_loop.cc:117) [W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker1 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker0 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) [W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker2: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) Traceback (most recent call last): File "/workspace/PiPPy/examples/resnet/pippy_resnet.py", line 163, in run_pippy(run_master, args) File "/workspace/PiPPy/pippy/utils.py", line 156, in run_pippy mp.spawn( File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with signal SIGABRT /opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

lengien avatar Jun 07 '23 12:06 lengien