jeffhataws
jeffhataws
Hi @wonjoolee95, I just want to follow up on this issue to see if there's a fix.
PR: https://github.com/pytorch/xla/pull/8094 Reason: Fix for autocast to enable cross entropy loss with FP32 precision [Done] Cherry-pick: https://github.com/pytorch/xla/pull/8201
PR: https://github.com/pytorch/xla/pull/8204 Reason: Multi-node SPMD support for Neuron Cherry-pick: https://github.com/pytorch/xla/pull/8224
It seems the problem occurs when DEBUG=1.
Just want to document some findings using the original MLP test using just CPU, printing ``met.metric_data("InputOutputAliasCount")``: ``` pt2.1 with functionalization on CPU: (4, 42.0, ((1719204665.609187, 16.0), (1719204667.7811365, 2.0), (1719204668.486205, 12.0),...
Minimal reproduction with only 1 linear layer and only gradient accumulation: ``` import os import torch import torch.nn as nn import torch.nn.functional as F import torch_xla.core.xla_model as xm import torch_xla.debug.metrics...
Using TOT, I modified torch_xla/csrc/xla_graph_executor.cpp to dump some info: ``` diff --git a/torch_xla/csrc/xla_graph_executor.cpp b/torch_xla/csrc/xla_graph_executor.cpp index 74c3270a9..d31924fb4 100644 --- a/torch_xla/csrc/xla_graph_executor.cpp +++ b/torch_xla/csrc/xla_graph_executor.cpp @@ -1264,6 +1264,7 @@ std::vector XLAGraphExecutor::SetBufferDonors( size_t tensor_index =...
In functionalization graph, there's one additional output of shape (768, 10) that's not there in no-functionalization case.
One additional datapoint is that when I increase the gradient accumulation count, I see that the input tensor ID keeps changing for each gradient accumulation step when functionalization is on....
Make sense. I think for the second graph, the input tensor_id 14 should be aliased to tensor_id 3, instead of aliased to itself (as indicated by the map in SetBufferDonors:...