Will Constable

Results 116 comments of Will Constable

> Regarding point 2 and 3, I'm not sure why ProcessGroup initialization affects p2p comm, could please further explain it? The way NCCL library currently works, it is necessary to...

I think there are a few options. 1- there is a new grad-scale feature inside pipelining. You can enable 'scale_grads=True', assuming you just want to scale by the num_microbatches, it...

Nothing really jumps out to me from the stack trace. I think you'll have to debug the crash to find out which variable/tensor/etc. was literally causing the segv; once you...

Thanks for this proposal @evkogs! We would need to get more specific about a design to say for sure, but I think there are largely 2 issues that need to...

You could use the flight recorder to dump traces for each rank's communication operations and then use the fr_trace script to analyze the data and potentially find a root cause....

I'd say 90% chance it's a bug in the code that calls the allreduce causing one or more ranks to not make the call on one particular step. 10% it's...

We need to improve the analysis script to print out more info. Cc The raw data including stack traces should be there already. If you can post a zip of...

I did a little more poking in the data files, and I did find that there is a widespread disagreement between ranks. I suggest adding a print statement before your...

I missed a detail before: you're using the `nn.all_reduce` operator, which is less commonly used and hence I didn't pick up on it before. This operator includes a backwards pass...

I think that residual connection is fine, and I don't see anything obviously wrong with your code. It looks like all_reduce is only called in one place, and I couldn't...