PiPPy
PiPPy copied to clipboard
Pippy ddp2pipe example doesn't work for pipeline
Hi, I’m using pippy for PP+DP. I ran the following code. pytorch/tau/blob/main/examples/ddp2pipe/ddp2pipe.py
I set the DIMS, PP LAYERS, DP LAYERS like this,
DIMS = [28 * 28, 300, 100, 30, 10] DP_LAYERS = 2 PP_LAYERS = 2
and used the command torchrun --nproc_per_node=4 ddp2pipe.py but It doesn’t work. I got the following error.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed. (...) /opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. On WorkerInfo(id=3, name=worker3): RuntimeError(‘CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.’) Traceback (most recent call last): File “/data/dd/anaconda3/envs/torch113/lib/python3.10/site-packages/torch/distributed/rpc/internal.py”, line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File “/data/dd/anaconda3/envs/torch113/lib/python3.10/site-packages/torch/distributed/rpc/rref_proxy.py”, line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File “/data/dd/anaconda3/envs/torch113/lib/python3.10/site-packages/pippy-0.1.0a0+80f91a4-py3.10.egg/pippy/PipelineDriver.py”, line 945, in get_value value = refcounted_future.future.wait() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (...) [W tensorpipe_agent.cpp:940] RPC agent for worker0 encountered error when reading incoming response from worker2: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
But, when I set PP_LAYERS=1 and any DP_LAYERS, It works for Dataparallel. I think this problem is related with Loss or classes of dataset. and there's no Loss phase in DDP Model so It works for DP. How can I solve this problem? and I’m wondering if I can use this script for multi-node. Thank you.
I'm using python 3.8 / PyTroch 1.13 / CUDA11.6 / NVIDIA TITAN Xp (I already created a topic for this problem on pytorch forum) https://discuss.pytorch.org/t/pippy-ddp2pipe-doesnt-work-for-pipeline/171608
I observed similar ERRORs with official examples, using python 3.8 / PyTroch 1.13 / CUDA11.6 / NVIDIA A100. Did you managed to solve this problem or find any candidate cause of the error?
I observed similar ERRORs with official examples, using python 3.8 / PyTroch 1.13 / CUDA11.6 / NVIDIA A100. Did you managed to solve this problem or find any candidate cause of the error?
according to what the developer said in the following link, DDP2Pipe is not stable for now as it only works for the parameters specified in the example, and it may work for CPU only.
https://discuss.pytorch.org/t/pippy-i-cant-see-backward-pass/170630/8
Hi, thanks for trying out this example.
The ddp2pipe
example is still work in progress. It works in CPU mode but for GPU mode, it requires setting of TORCH_DISTRIBUTED_DEBUG=DETAIL
to avoid a hang.
I'd like to mention that ddp2pipe
is an example of using DDP to wrap part of a model, followed by pipelining the rest of the model. It is not an example of data + pipeline parallel in 2D sense.
A 2D data + pipeline parallel example is at: https://github.com/pytorch/tau/blob/main/test/local_test_ddp.py If you are interested, I can also create a 2D example with a realistic model (like T5).
@kwen2501
Thank you for letting me know. I thought it was dp+pp. I'll try the example you mentioned. and now I'm wondering what ddp2pipe is for. (the reason you wrapped part of the model, and pipelined the rest)