lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

`CUDAGraphExecutor` - limited to static graphs only

Open nikitaved opened this issue 1 year ago • 4 comments

🐛 Bug

With https://github.com/Lightning-AI/lightning-thunder/pull/430 merged we enable the usage of thunder.jit(..., use_cudagraphs=True). This makes sure that the forward callable is wrapped into thunder.cudagraphs.CUDAGraphExecutor. This executor, however, assumes a static structure of the code. We might consider torch.cuda.make_graphed_callables as a safer option.

nikitaved avatar May 17 '24 15:05 nikitaved

It's an inherent feature of CUDA Graphs to be restricted to static code. Newer CUDA Toolkit versions have dynamic control flow feature, but it's unavailable to use in PyTorch for now https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/ What additional safety does torch.cuda.make_graphed_callables provide?

IvanYashchuk avatar Jun 14 '24 06:06 IvanYashchuk

No additional safety, I got it wrong. What I mean is that we can indeed make it a transform to potentially decide which parts are safe to capture.

nikitaved avatar Jun 14 '24 10:06 nikitaved

triage review:

  • Nik's new executor has allow list / deny list that tags whether an op is cuda graph-able
  • @nikitaved what's the current status of the executor?
  • priority here: we have issues with prologue latency today; cuda graphs could be one solution here. we do not have a specific use case that is blocked on latency, yet, but it seems likely to come up in the medium term. graphs are important for scale out, and helps mitigate jitter problems. we have a couple things ahead of this, priority-wise (getting NeMo model to work on training side, matmul quantization on inference side), but we expect this will be a top priority after that

tfogal avatar Jul 15 '24 15:07 tfogal

@tfogal , the executor is in good shape, just not complete. This means, there is no "advanced" logic on handling data-dep operations and fusion regions between graph breaks (with dynamic shapes, sometimes we can put a tensor into a fusion, sometimes we have to opt-out). This bit was intentionally left in this state to collect issues and shape our understanding of how to handle them in our use cases... To summarize, the original issue is still present, but we can fix it, at least partially, by modifying the fusion logic in the current executor.

nikitaved avatar Jul 15 '24 16:07 nikitaved