Will Constable
Will Constable
> Regarding point 2 and 3, I'm not sure why ProcessGroup initialization affects p2p comm, could please further explain it? The way NCCL library currently works, it is necessary to...
I think there are a few options. 1- there is a new grad-scale feature inside pipelining. You can enable 'scale_grads=True', assuming you just want to scale by the num_microbatches, it...
Nothing really jumps out to me from the stack trace. I think you'll have to debug the crash to find out which variable/tensor/etc. was literally causing the segv; once you...
Thanks for this proposal @evkogs! We would need to get more specific about a design to say for sure, but I think there are largely 2 issues that need to...
You could use the flight recorder to dump traces for each rank's communication operations and then use the fr_trace script to analyze the data and potentially find a root cause....
I'd say 90% chance it's a bug in the code that calls the allreduce causing one or more ranks to not make the call on one particular step. 10% it's...
We need to improve the analysis script to print out more info. Cc The raw data including stack traces should be there already. If you can post a zip of...
I did a little more poking in the data files, and I did find that there is a widespread disagreement between ranks. I suggest adding a print statement before your...
I missed a detail before: you're using the `nn.all_reduce` operator, which is less commonly used and hence I didn't pick up on it before. This operator includes a backwards pass...
I think that residual connection is fine, and I don't see anything obviously wrong with your code. It looks like all_reduce is only called in one place, and I couldn't...