lightning-thunder
lightning-thunder copied to clipboard
`CUDAGraphExecutor` - limited to static graphs only
🐛 Bug
With https://github.com/Lightning-AI/lightning-thunder/pull/430 merged we enable the usage of thunder.jit(..., use_cudagraphs=True). This makes sure that the forward callable is wrapped into thunder.cudagraphs.CUDAGraphExecutor. This executor, however, assumes a static structure of the code. We might consider torch.cuda.make_graphed_callables as a safer option.
It's an inherent feature of CUDA Graphs to be restricted to static code. Newer CUDA Toolkit versions have dynamic control flow feature, but it's unavailable to use in PyTorch for now https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/
What additional safety does torch.cuda.make_graphed_callables provide?
No additional safety, I got it wrong. What I mean is that we can indeed make it a transform to potentially decide which parts are safe to capture.
triage review:
- Nik's new executor has allow list / deny list that tags whether an op is cuda graph-able
- @nikitaved what's the current status of the executor?
- priority here: we have issues with prologue latency today; cuda graphs could be one solution here. we do not have a specific use case that is blocked on latency, yet, but it seems likely to come up in the medium term. graphs are important for scale out, and helps mitigate jitter problems. we have a couple things ahead of this, priority-wise (getting NeMo model to work on training side, matmul quantization on inference side), but we expect this will be a top priority after that
@tfogal , the executor is in good shape, just not complete. This means, there is no "advanced" logic on handling data-dep operations and fusion regions between graph breaks (with dynamic shapes, sometimes we can put a tensor into a fusion, sometimes we have to opt-out). This bit was intentionally left in this state to collect issues and shape our understanding of how to handle them in our use cases... To summarize, the original issue is still present, but we can fix it, at least partially, by modifying the fusion logic in the current executor.