PiPPy
PiPPy copied to clipboard
How did this error happen when i run example about resnet?
root@6496cf66be1e:/workspace/PiPPy/examples/resnet# python pippy_resnet.py -s=1F1B
[PiPPy] World size: 5, DP group size: 1, PP group size: 5
rank = 4 host/pid/device = 6496cf66be1e/2823/cuda:4
[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
rank = 1 host/pid/device = 6496cf66be1e/2820/cuda:1
[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
rank = 0 host/pid/device = 6496cf66be1e/2819/cuda:0
rank = 2 host/pid/device = 6496cf66be1e/2821/cuda:2
rank = 3 host/pid/device = 6496cf66be1e/2822/cuda:3
REPLICATE config: 0 -> MultiUseParameterConfig.TRANSMIT
Using schedule: 1F1B
Using device: cuda:0
Files already downloaded and verified
Epoch: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1250/1250 [02:38<00:00, 7.89it/s]
Loader: train. Accuracy: 0.3269
Epoch: 2
65%|████████████████████████████████████████████████████████████████████████████▉ | 815/1250 [01:39<00:51, 8.43it/s][W CUDAGuardImpl.h:124] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
[W CUDAGuardImpl.h:124] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1678411187366/work/aten/src/ATen/native/cuda/Loss.cu:460: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.
[W tensorpipe_agent.cpp:678] RPC agent for worker4 encountered error when sending response to request #97107 to worker3: device-side assert triggered (this error originated at tensorpipe/common/cuda_loop.cc:117)
[W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker1 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker0 encountered error when reading incoming request from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker1 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:939] RPC agent for worker3 encountered error when reading incoming response from worker4: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker2 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:725] RPC agent for worker3 encountered error when reading incoming request from worker2: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Traceback (most recent call last):
File "/workspace/PiPPy/examples/resnet/pippy_resnet.py", line 163, in