PiPPy
PiPPy copied to clipboard
Pipeline Parallelism for PyTorch
Hi, I’m using pippy for PP+DP. I ran the following code. [pytorch/tau/blob/main/examples/ddp2pipe/ddp2pipe.py](https://github.com/pytorch/tau/blob/main/examples/ddp2pipe/ddp2pipe.py) I set the DIMS, PP LAYERS, DP LAYERS like this, DIMS = [28 * 28, 300, 100, 30,...
This error is thrown when using torch.nn.CrossEntropyLoss() with SPMD API.
Currently MNIST benchmark fails due to unsupported convolution ops in the DTensor registry. Error: NotImplementedError: Operator aten.convolution.default does not have a DistributedTensor rule registered.
PR's for tau are failing due to an unrelated missing rule for DTensor: "Operator aten.fill.Scalar does not have a DistributedTensor rule registered." Details: Traceback (most recent call last): File "/__w/tau/tau/test/spmd/tensor/test_dtensor_ops.py",...
### What is the issue: Using pippy for HF model [inference](https://github.com/pytorch/tau/tree/main/examples/inference), it uses FX tracer under the hood. Seq2Seq models such as T5, or decoders such as OPT, bloom that...
Subtask of https://github.com/pytorch/PiPPy/issues/299
currently we run fusion based on an integer policy, where integer maps to total number of comm calls to fuse. need to add a bucket size policy handler to setup...
Currently we assume all comm calls can be fused with any other comm call (i.e. all use default process group). This is usually correct, but need to implement check of...
Currently we default to FP32 for the fusion buffer, but that is not correct for mixed precison cases. Thus, need to check shape prop metadata and build correct buffer dtype.
After PR https://github.com/pytorch/tau/pull/631 lands, add unit testing. Simple tests would involved fusion based on a set policy and verifying output gradients and inspecting the graph.