CUDA transform of a NeVA graph
🐛 Bug
Inside the CUDAGraphTransform to_arg_descriptor code, zip(*map(extract_descriptor, args)) is not getting the values it expects and this leads to a tuple unpacking issue.
Traceback (most recent call last):
File "/home/tfogal/dev/nemo/./examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 484, in thunder_graph_backend4
fqn(*args) # run once to force compilation
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1748, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tfogal/dev/thunder/thunder/core/module.py", line 81, in forward
res = self._forward_fn(*args, **kwargs)
File "/home/tfogal/dev/thunder/thunder/__init__.py", line 787, in fn_
result = cache_entry.computation_fn(*inps)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
File "thunder.computation_261", line 8, in computation
[scale_t, getitem] = CUDAGraph1()
File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 193, in callable
return self.call_cuda_graph(fn_name, *args)
File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 119, in call_cuda_graph
cache_key = self.make_cache_key(fn_name, *args)
File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 114, in make_cache_key
return (fn_name, self.make_static_inputs_mask(fn_name, *args), to_arg_descriptor(*args))
File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 36, in to_arg_descriptor
dtypes, sizes, strides, non_tensor_args = zip(*map(extract_descriptor, args))
ValueError: not enough values to unpack (expected 4, got 0)
To Reproduce
Run the NeVA model (#343) model and apply the graph transform:
if use_cuda_graph(thunder_graphs):
xform = thunder.transforms.cudagraph.CUDAGraphTransform()
fqn = thunder.jit(gm, transforms=[xform])
else:
fqn = thunder.jit(gm)
fqn(*args) # run once to force compilation
use_cuda_graph just takes a counter of the graph we're seeing, and it returns true if it is graph 37 (at least for this bug).
Code sample
TomF to edit this in later
Expected behavior
Transform the program to use a graph[s]. Or throw a reasonable error about why that's not possible.
Environment
Using thunder 4550f6bf231885cb6745fd8d705357c32ad0f2d0.
$ nvidia-smi | grep "CUDA V"
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
Additional context
We're not yet sure launch latency is on the critical path for NeVA; I am just experimenting to see. The impact a graph would have on launch latency could help confirm this.
Thank you for the report, @tfogal
If args is empty, this will give an empty list: https://github.com/Lightning-AI/lightning-thunder/blob/50f587d6ff6c17a8f8392c57a0e6a73b0fe298fb/thunder/transforms/cudagraph.py#L36 and so unpacking fails.
We should handle empty args by special casing it and setting the for lhs items to empty tuples.
I'll send a PR.
The other trouble we'll run into is that the autograd passes delete the marks of static inputs. This needs fixing and is on our list, but it is quite a refactor. #1138
@tfogal with #1324 fixing the immediate issue, I would close this or assign it back to you for update. Sorry for taking long.