lightning-thunder
lightning-thunder copied to clipboard
autocast is incorrectly applied even if the requested device is different.
From the example below, the autocast is applied only for device cuda, however thunder.jit still applies it to CPU inputs.
import thunder
import torch
def foo(x, w):
return torch.nn.functional.linear(x, w)
device = torch.device("cpu")
with device:
x, w = torch.randn(16, 16), torch.randn(16, 16)
print(x.dtype, w.dtype)
jfoo = thunder.jit(foo)
# Autocast is applied to different device.
with torch.autocast("cuda", torch.bfloat16):
jit_out = jfoo(x, w)
print(thunder.last_traces(jfoo)[-1])
Output
# Constructed by Delete Last Used (took 0 milliseconds)
from torch import Tensor
import torch
import torch.nn.functional
from thunder.executors.torchex import no_autocast
@torch.no_grad()
@no_autocast
def computation(x, w):
# x: "cpu f32[16, 16]"
# w: "cpu f32[16, 16]"
t0 = Tensor.to(x, torch.bfloat16, copy=True) # t0: "cpu bf16[16, 16]"
# t0 = ltorch.to(x, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None) # t0: "cpu bf16[16, 16]"
# t0 = prims.convert_element_type(x, dtypes.thunder.dtypes.bfloat16) # t0: "cpu bf16[16, 16]"
del x
t1 = Tensor.to(w, torch.bfloat16, copy=True) # t1: "cpu bf16[16, 16]"
# t1 = ltorch.to(w, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None) # t1: "cpu bf16[16, 16]"
# t1 = prims.convert_element_type(w, dtypes.thunder.dtypes.bfloat16) # t1: "cpu bf16[16, 16]"
del w
t2 = torch.nn.functional.linear(t0, t1, None) # t2: "cpu bf16[16, 16]"
# t2 = ltorch.linear(t0, t1, None) # t2: "cpu bf16[16, 16]"
# t2 = prims.linear(t0, t1, None) # t2: "cpu bf16[16, 16]"
del t0, t1
return t2
cc @crcrpar
that's right, in autocast we don't consider device
https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/core/transforms.py#L3788
does this have practical impacts on target models?
AFAIK, NeMo does use autocast. With our current implementation, we may silently add conversions when user asked to apply autocast only on a certain device and if there are operations happening on both CPU and GPU in that context. Honestly, I don't think it happens in practice.
@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?
@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?
I don't know, sorry :-( @athitten might.
But I agree with you that it is unlikely---we could just not support it for now. But I would ask that we 'loudly' not support mixed-device autocast: can we check for this case and error out when it happens?
I’m 100% for failing loudly if it’s not a beaten path (and this one looks like it’s not)
triage review:
- doesn't seem like a priority right now
- yes, we should mirror the behavior of the torch interface
- we could extend torch's autocast interface (sounds like a good approach), but let's not use torch's interface for that, we can build our own for it