have a method to compare speed of different parts of training between compilation backends
🚀 Feature
Have a method to annotate pieces of training code (e.g. benchmark_litgpt) so we can easily and automatically compare effectiveness of different compilation methods / versions of Thunder on these pieces instead of whole training loop.
Motivation
Right now we provide regular benchmarking of Thunder on LitGPT models. If we spot that Thunder is slower than Inductor or there is a regression, the cause must be found manually by investigating logs / mapping kernels from nsys profiling reports, which have different names. It's not easy to speed-up and automate this process.
If we were able to split the training loop into smaller chunks and annotate them, we could automatically find the chunks that are slower / can be improved and provide them together with benchmarking results. This would allow us to spot issues to be improved much faster.
Pitch
One idea would be to use NXTV markers - I was able to add them to LitGPT models, so we know the duration of forward passes for each module.
It seems to look good for Inductor and Eager mode, however for Thunder some of the markers got removed. Maybe there would be a way to preserve them with some rules applied if e.g. there is problem with fusing operations?
The problem also is that in this way we miss the backward passes. I found that it can be solved by using: https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx and I'm looking into it right now.
Alternatives
If there are some alternatives, please let me know.