PiPPy
PiPPy copied to clipboard
PyTorch native 2D LLaMA inference
Current status
Working
# PP = 2, TP = 4
$ torchrun --nproc-per-node 8 pippy_llama.py
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
Previous issues:
TP self attention hitting the following issue:
view = l__self___model_layers_0_self_attn_q_proj.view(4, 4, 32, 128); l__self___model_layers_0_self_attn_q_proj = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 4, 32, 128]' is invalid for input of size 16384
4 * 4 * 32 * 128 = 65536 65536 / 4 = 16384 (4 is my TP size) so that explains it. User code:
xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
Cc: @fduwjj @wanchaol @HamidShojanazeri @wconstab
Can you shed light here? @fduwjj mentioned that we would need to modify self.n_local_heads to be 4 times smaller -- whether in eager case or traced case. In the traced case, I can modify the view node to change its arg, for example, 32 -> 8. That's slightly better than asking user to modify model code. But, is there a better way?
I still think the cleanest fix here is to make PP tracing + unflattener
work, otherwise we should probably wait for DTensor supports scaled dot product attention op instead, the current thing that use_local_outputs
works surprised me, I think the only reason is that the llama 7B does not use scaled_dot_product_attention
If they are all tensors, scaled_dot_product_attention
should work as long as we pass in correct sizes?
Documenting my discussion with @wanchaol wrt to DTensor and scaled_dot_product_attention
:
@kwen2501 : Should we do to_local as soon as we did colwise, or should we do to_local when we hit some op like scaled dot product, or should we have scaled dot product support a local form of DTensor. Maybe 2 and 3 are the same thing, meaning, the dispatcher of DTensor performs a to_local before calling the actual scaled dot product.
@wanchaol : The current way is that we do to_local as soon as we leave the linear layer computation, this is the easiest thing to do with module forward hooks, if we do to_local as soon as we hit op like scaled dot product attention, I feel this is technically like implementing the scaled dot product attention op already. i.e. when implementing a DTensor op, we just figure out the sharding and then call local tensor with the op
@kwen2501 : Now, in this case, the view ops are between colwise and scaled dot product. So it seems that the delayed route would work better. But i do agree that, if without the view ops, the early route would be easier. This means, the delayed route is a user choice (likely non-default), and we patch that route with DTensor support of scale dot product.
@wanchaol : Yeah I think we should support both routes via use_local_output=False/True. the delayed route require us to implement scaled dot product attention I think but it shouldn’t be too hard to enable it.