lightning-thunder autocast is incorrectly applied even if the requested device is different.

From the example below, the autocast is applied only for device cuda, however thunder.jit still applies it to CPU inputs.

import thunder
import torch

def foo(x, w):
    return torch.nn.functional.linear(x, w)

device = torch.device("cpu")
with device:
    x, w = torch.randn(16, 16), torch.randn(16, 16)
    print(x.dtype, w.dtype)

jfoo = thunder.jit(foo)

# Autocast is applied to different device.
with torch.autocast("cuda", torch.bfloat16):
    jit_out = jfoo(x, w)

print(thunder.last_traces(jfoo)[-1])

Output

# Constructed by Delete Last Used (took 0 milliseconds)
from torch import Tensor
import torch
import torch.nn.functional
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(x, w):
  # x: "cpu f32[16, 16]"
  # w: "cpu f32[16, 16]"
  t0 = Tensor.to(x, torch.bfloat16, copy=True)  # t0: "cpu bf16[16, 16]"
    # t0 = ltorch.to(x, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None)  # t0: "cpu bf16[16, 16]"
      # t0 = prims.convert_element_type(x, dtypes.thunder.dtypes.bfloat16)  # t0: "cpu bf16[16, 16]"
  del x
  t1 = Tensor.to(w, torch.bfloat16, copy=True)  # t1: "cpu bf16[16, 16]"
    # t1 = ltorch.to(w, torch.bfloat16, None, device=None, dtype=None, copy=True, memory_format=None)  # t1: "cpu bf16[16, 16]"
      # t1 = prims.convert_element_type(w, dtypes.thunder.dtypes.bfloat16)  # t1: "cpu bf16[16, 16]"
  del w
  t2 = torch.nn.functional.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
    # t2 = ltorch.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
      # t2 = prims.linear(t0, t1, None)  # t2: "cpu bf16[16, 16]"
  del t0, t1
  return t2

cc @crcrpar

Jul 04 '24 08:07 kshitij12345

that's right, in autocast we don't consider device

https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/core/transforms.py#L3788

does this have practical impacts on target models?

Jul 04 '24 09:07 lantiga

AFAIK, NeMo does use autocast. With our current implementation, we may silently add conversions when user asked to apply autocast only on a certain device and if there are operations happening on both CPU and GPU in that context. Honestly, I don't think it happens in practice.

@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?

Jul 04 '24 10:07 kshitij12345

@tfogal do you know if NeMo does both CPU and GPU operations (which are affected by autocast ctx manager) within a single autocast context?

I don't know, sorry :-( @athitten might.

But I agree with you that it is unlikely---we could just not support it for now. But I would ask that we 'loudly' not support mixed-device autocast: can we check for this case and error out when it happens?

Jul 05 '24 23:07 tfogal

I’m 100% for failing loudly if it’s not a beaten path (and this one looks like it’s not)

Jul 06 '24 10:07 lantiga

triage review:

doesn't seem like a priority right now
yes, we should mirror the behavior of the torch interface
we could extend torch's autocast interface (sounds like a good approach), but let's not use torch's interface for that, we can build our own for it

Jul 08 '24 15:07 tfogal

lightning-thunder lightning-thunder copied to clipboard

autocast is incorrectly applied even if the requested device is different.

lightning-thunder
lightning-thunder copied to clipboard