Will Constable comments

Results 116 comments of


                                            Will Constable

RFC-0042-torch-distributed-redesign

> For send/recv, yes, kind of. There are other more complicated cases, though. For example, broadcast: ok, these look like the same thing to me. Basically, if we added support...

Moved logits `.float()` to loss and compiled it if compiling

> @H-Huang @wconstab do you have any idea if the output logits being fp32 is a hard requirement for PP? anyway we can leave them as bf16? sorry didn't see...

Question about Pipeline parallelism

The root issue here is that a recv operation needs to know its size before starting. So every stage besides 0 needs to know the size of the new microbatch...

Only half of parameters are saved when applied PP

hmm. we shouldn't really need DTensor to solve the problem of layer0 being saved and layer1 not being saved. The fqn should be preserved and not conflict, so we should...

Only half of parameters are saved when applied PP

could you share the exact repro command so we can debug?

Only half of parameters are saved when applied PP

sorry for not getting back to you sooner. > in the case of rank 1, although rank 1 has 16~31 layer params, its key names are "model.layer.0.self_attn....", "model.layer.1.self_attn...." ..., "model.layer.15.self_attn....",...

Only half of parameters are saved when applied PP

Going to close the issue as 'can not reproduce' but feel free to reopen if you have additional input or questions! @dmammfl

[C10D] Support group ranks in P2POp and batch_isend_irecv

@pytorchbot merge -i

[fr] fix OSS broken flight recorder

@c-p-i-o do we have any actualy test for e2e usage in OSS? would be good to have some coverage

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective

I had a question based on the example in the PR desc- what does it mean to torch compile a function but mark it as full graph=true? Clearly you are...