jeffhataws

Results 63 comments of jeffhataws

PR: https://github.com/pytorch/xla/pull/8094 Reason: Fix for autocast to enable cross entropy loss with FP32 precision [Done] Cherry-pick: https://github.com/pytorch/xla/pull/8201

PR: https://github.com/pytorch/xla/pull/8204 Reason: Multi-node SPMD support for Neuron Cherry-pick: https://github.com/pytorch/xla/pull/8224

It seems the problem occurs when DEBUG=1.

Just want to document some findings using the original MLP test using just CPU, printing ``met.metric_data("InputOutputAliasCount")``: ``` pt2.1 with functionalization on CPU: (4, 42.0, ((1719204665.609187, 16.0), (1719204667.7811365, 2.0), (1719204668.486205, 12.0),...

Minimal reproduction with only 1 linear layer and only gradient accumulation: ``` import os import torch import torch.nn as nn import torch.nn.functional as F import torch_xla.core.xla_model as xm import torch_xla.debug.metrics...

Using TOT, I modified torch_xla/csrc/xla_graph_executor.cpp to dump some info: ``` diff --git a/torch_xla/csrc/xla_graph_executor.cpp b/torch_xla/csrc/xla_graph_executor.cpp index 74c3270a9..d31924fb4 100644 --- a/torch_xla/csrc/xla_graph_executor.cpp +++ b/torch_xla/csrc/xla_graph_executor.cpp @@ -1264,6 +1264,7 @@ std::vector XLAGraphExecutor::SetBufferDonors( size_t tensor_index =...

In functionalization graph, there's one additional output of shape (768, 10) that's not there in no-functionalization case.

One additional datapoint is that when I increase the gradient accumulation count, I see that the input tensor ID keeps changing for each gradient accumulation step when functionalization is on....

Make sense. I think for the second graph, the input tensor_id 14 should be aliased to tensor_id 3, instead of aliased to itself (as indicated by the map in SetBufferDonors:...