jeffhataws

Results 63 comments of jeffhataws

Thanks @JackCaoG for the detailed root-cause. @bdhirsh do you think this can be fix in torch 2.4?

Hi @JackCaoG , I tried this ``` diff --git a/torchgen/gen_functionalization_type.py b/torchgen/gen_functionalization_type.py index 8d2c567c347..2392fe60a1d 100644 --- a/torchgen/gen_functionalization_type.py +++ b/torchgen/gen_functionalization_type.py @@ -589,7 +589,9 @@ def wrap_propagate_mutations_and_return( at::functionalization::impl::propagate_xla_data({outer_arg}, {inner_ret}); at::functionalization::impl::replace_({outer_arg}, {inner_ret}); at::functionalization::impl::commit_update({outer_arg}); -...

Let's enable these tests on CPU if possible. We will also enable this particular one for Neuron.

Thanks @bfolie . Do you know if [this line](https://github.com/pytorch/xla/blob/master/torch_xla/csrc/cross_replica_reduces.cpp#L312) is also affected?

For some reason multi-node all-gather is now crashing. Let me debug and isolate a testcase. The crash trace is below in case you know where to look: ``` F0616 05:45:23.533522...

@bfolie https://github.com/pytorch/xla/pull/9403 is the fix for the allgather issue above. I will try to narrow down to a smaller unit test.

Narrowed down to single-node test, but it still has NeuronX Distributed and real dataset dependencies. Will narrow down some more.

We should support 3.13 also.