PiPPy
PiPPy copied to clipboard
Pipeline Parallelism for PyTorch
Per Alisson - we can reduce memory overhead by having the global buffer first created at first use (i.e. just before first fusion) rather than the current instantiation at the...
[spmd] incorrect aten.expand call with nn.linear (expanded size must match existing size at dim 0)
This is to track/investigate the issue reported by Rich Zhu, where using permute to generate a transposed tensor for nn.linear, results in an incorrect aten.expand call. I've found two potential...
CI failure caused by HF changes. ``` test/hf_test.py:637: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _...
After expansion of DTensor communication operations, fx is inserting a clone operation to clone the gradient tensor. This operation will slow down the perf and add memory, but is technically...
In running the pytests for a recent PR, I was allocated a 3 gpu server rather than 4 gpu. (presumably a bad gpu on a 4 gpu server, but unclear...
Implement a graph using torch.cat. convert it via SPMD Receive: raise NotImplementedError( NotImplementedError: Operator aten.cat.default does not have a DistributedTensor rule registered.) code location: File "/home/ubuntu/graph/spmd/api.py", line 110, in _get_dtensor_dispatch_graph...
**What the problem is:** Both single-node and sharded `TensorParallelMultiheadAttention`(#477) modules diverge (the forward output becomes `-inf` after less than 10 iterations). Also they produce different forward output of which the...
**What the problem is:** - Sharded `TensorParallelMultiheadAttention`(#477) module fails to update `proj.bias` parameter though the back-propagated **gradient is correct**. - Also, this error doesn't occur on rank 0. **How to...
Passing a DTensor into spmd.distribute_tensor , or more specifically, into DeviceMesh, will cause issues - in device_mesh.broadcast, it will cause an assert to fail deep into torch code - in...