Will Constable
Will Constable
> For send/recv, yes, kind of. There are other more complicated cases, though. For example, broadcast: ok, these look like the same thing to me. Basically, if we added support...
> @H-Huang @wconstab do you have any idea if the output logits being fp32 is a hard requirement for PP? anyway we can leave them as bf16? sorry didn't see...
The root issue here is that a recv operation needs to know its size before starting. So every stage besides 0 needs to know the size of the new microbatch...
hmm. we shouldn't really need DTensor to solve the problem of layer0 being saved and layer1 not being saved. The fqn should be preserved and not conflict, so we should...
could you share the exact repro command so we can debug?
sorry for not getting back to you sooner. > in the case of rank 1, although rank 1 has 16~31 layer params, its key names are "model.layer.0.self_attn....", "model.layer.1.self_attn...." ..., "model.layer.15.self_attn....",...
Going to close the issue as 'can not reproduce' but feel free to reopen if you have additional input or questions! @dmammfl
@pytorchbot merge -i
@c-p-i-o do we have any actualy test for e2e usage in OSS? would be good to have some coverage
I had a question based on the example in the PR desc- what does it mean to torch compile a function but mark it as full graph=true? Clearly you are...