lightning-thunder Refactor integration with PyTorch's Autograd

One of the complications with the previous PyTorch Autograd integration is that we dealt with two traces: forward and backward. They were stitched together and connected by ThunderFunction.apply somewhere in the thunder.jit code. We want to support the composable application of trace transformations in any order as transform3(transform2(transform1(computation_trace))), each transform might potentially change the meaning of backward so it must also be applied to the backward pass and it becomes complicated to handle two distinct objects and keep them in sync.

This is a first step in the refactoring process. I think the simplest way to transparently have different transforms to interact with PyTorch Autograd registration is to have a special symbol that represents registration into Autograd, one of the inputs to this symbol is a Trace that should be modified when a transform is applied. The details of what exactly the interface of this symbol should be are not clear. Maybe the subsymbols of this symbol could represent the backward trace.

todo:

[ ] Fill in more details in the description.

Apr 17 '24 17:04 IvanYashchuk

Very interesting and exciting, thank you for working on this!

The details of what exactly the interface of this symbol should be are not clear. Maybe the subsymbols of this symbol could represent the backward trace.

I think it would be quite a departure from what subsymbols currently are, unless we manage to have a "build trace from" or so.

Jun 14 '24 06:06 t-vi

Current failures to fix:

FAILED thunder/tests/test_networks.py::test_nanogpt_complete_cuda_graphs_autograd_nvfuser_cuda_float32 - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [4, 64]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
FAILED thunder/tests/test_networks.py::test_nanogpt_complete_cuda_graphs_autograd_torch_cuda_float32 - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [4, 64]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
============= 2 failed, 15 passed, 3 warnings in 79.89s (0:01:19) ==============

Jun 14 '24 13:06 IvanYashchuk

Current failure to fix:

=========================== short test summary info ============================
FAILED thunder/tests/test_core.py::test_dataclass_output[True] - AssertionError: argument layout mismatch: [<TensorProxy(name="x", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t2", dtype=thunder.dtypes.float32, shape=(3, 3))>] (<TensorProxy(name="t1", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t4", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t5", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t6", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t7", dtype=thunder.dtypes.float32, shape=(3, 3))>, <TensorProxy(name="t8", dtype=thunder.dtypes.float32, shape=(3, 3))>)

Jul 03 '24 12:07 IvanYashchuk

Seems to be superseded. Please reopen if you disagree.

Jun 12 '25 06:06 t-vi