jeffhataws
jeffhataws
Thanks @JackCaoG for the detailed root-cause. @bdhirsh do you think this can be fix in torch 2.4?
Hi @JackCaoG , I tried this ``` diff --git a/torchgen/gen_functionalization_type.py b/torchgen/gen_functionalization_type.py index 8d2c567c347..2392fe60a1d 100644 --- a/torchgen/gen_functionalization_type.py +++ b/torchgen/gen_functionalization_type.py @@ -589,7 +589,9 @@ def wrap_propagate_mutations_and_return( at::functionalization::impl::propagate_xla_data({outer_arg}, {inner_ret}); at::functionalization::impl::replace_({outer_arg}, {inner_ret}); at::functionalization::impl::commit_update({outer_arg}); -...
Let's enable these tests on CPU if possible. We will also enable this particular one for Neuron.
Thanks @bfolie . Do you know if [this line](https://github.com/pytorch/xla/blob/master/torch_xla/csrc/cross_replica_reduces.cpp#L312) is also affected?
Thanks @bfolie for the fix! I have confirmed that it works for Neuron.
For some reason multi-node all-gather is now crashing. Let me debug and isolate a testcase. The crash trace is below in case you know where to look: ``` F0616 05:45:23.533522...
@bfolie https://github.com/pytorch/xla/pull/9403 is the fix for the allgather issue above. I will try to narrow down to a smaller unit test.
Narrowed down to single-node test, but it still has NeuronX Distributed and real dataset dependencies. Will narrow down some more.
Seems to happen with TP + ZeRO1.
We should support 3.13 also.