PiPPy PyTorch native 2D LLaMA inference

Current status

Working

# PP = 2, TP = 4
$ torchrun --nproc-per-node 8 pippy_llama.py
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']

Previous issues:

TP self attention hitting the following issue:

view = l__self___model_layers_0_self_attn_q_proj.view(4, 4, 32, 128);  l__self___model_layers_0_self_attn_q_proj = None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 4, 32, 128]' is invalid for input of size 16384

4 * 4 * 32 * 128 = 65536 65536 / 4 = 16384 (4 is my TP size) so that explains it. User code:

xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)

Cc: @fduwjj @wanchaol @HamidShojanazeri @wconstab

Can you shed light here? @fduwjj mentioned that we would need to modify self.n_local_heads to be 4 times smaller -- whether in eager case or traced case. In the traced case, I can modify the view node to change its arg, for example, 32 -> 8. That's slightly better than asking user to modify model code. But, is there a better way?

Dec 21 '23 23:12 kwen2501

I still think the cleanest fix here is to make PP tracing + unflattener work, otherwise we should probably wait for DTensor supports scaled dot product attention op instead, the current thing that use_local_outputs works surprised me, I think the only reason is that the llama 7B does not use scaled_dot_product_attention

Jan 03 '24 00:01 wanchaol

If they are all tensors, scaled_dot_product_attention should work as long as we pass in correct sizes?

Jan 08 '24 22:01 fduwjj

Documenting my discussion with @wanchaol wrt to DTensor and scaled_dot_product_attention:

@kwen2501 : Should we do to_local as soon as we did colwise, or should we do to_local when we hit some op like scaled dot product, or should we have scaled dot product support a local form of DTensor. Maybe 2 and 3 are the same thing, meaning, the dispatcher of DTensor performs a to_local before calling the actual scaled dot product.

@wanchaol : The current way is that we do to_local as soon as we leave the linear layer computation, this is the easiest thing to do with module forward hooks, if we do to_local as soon as we hit op like scaled dot product attention, I feel this is technically like implementing the scaled dot product attention op already. i.e. when implementing a DTensor op, we just figure out the sharding and then call local tensor with the op

@kwen2501 : Now, in this case, the view ops are between colwise and scaled dot product. So it seems that the delayed route would work better. But i do agree that, if without the view ops, the early route would be easier. This means, the delayed route is a user choice (likely non-default), and we patch that route with DTensor support of scale dot product.

@wanchaol : Yeah I think we should support both routes via use_local_output=False/True. the delayed route require us to implement scaled dot product attention I think but it shouldn’t be too hard to enable it.

Jan 11 '24 21:01 kwen2501

PiPPy PiPPy copied to clipboard

PyTorch native 2D LLaMA inference

Current status

Previous issues:

PiPPy
PiPPy copied to clipboard