Add nvtx around some important components.
What does this PR do?
Starts to add nvtx around (some) of the larger or at least more variable components of our compilation and execution. These help with identifying performance issues.
Some quick examples of what this does: at a high-level, we quickly see where we're spending compilation time:
Drilling into performance of actual execution:
On the right one can see how the effects of the NvtxProfileTransform (and, in this case, nvFuser's nvtx) nest within the added computation_fn.
The intent with this PR is to capture the largest elements of where our time is spent. You may notice in the screenshots that there is still some time unaccounted for--I am happy to go further based on feedback but felt this is in a good enough place to solicit feedback already.
PR review
Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
How do people feel about an nvtx package dependency?
The nvtx library includes the concept of a 'domain' argument that can isolate different libraries in a trace. Unfortunately torch's nvtx.range_push() does not support this.
The nvtx python package supports this. If we used the nvtx package, then instead of naming each nvtx region as "thunder foo" we could just use a "thunder" domain. This lets us have a special thunder line when looking at traces:
which can be nice. Of course, this means we have a dependency that we wouldn't otherwise have.
The latest version tries to use nvtx domains, if the package is available, and falls back to torch's implementation if not. But reasonable people may prefer seeing all the nvtx together 🤷 . Please comment.
Before using NVTX domains the execution time should be measured. From https://nvtx.readthedocs.io/en/latest/annotate.html#domains:
Domains should be used sparingly as they are expensive to create.
From this statement it's not clear whether expensive refers to filesize or CPU overhead.
They are expensive to create. They are cheap otherwise. The nvtx lib initializes on first use and so the first nvtx with a domain is expensive.
I don't think Nsight Systems (nsys) is the right tool for everyone to do Python CPU profiling. Yes, these annotations are useful for those who use nsys, but they are completely useless for others and add unnecessary complexity to the code.
I haven't found any other CPU profilers to be useful for our situation. Which ones do you see success with?
It's okay to annotate the hot path for easier mapping of regions to GPU kernels, it should be done using Pythonic constructs like decorators and context managers. It would be awesome to refactor the code of the main function
_fnto use clear stages calling into other functions so that any other profiling instrumentation could use this information without a practitioner resorting to line profilers or range-based annotations like here.
Refactors are always nice, but I cannot sign up for that here just for profiling, sorry.
I agree with Masaki that there should be a cohesive style for these annotations.
Okay, could you (or Masaki) please comment either way? I am happy to do whatever here but the guidance at present is "away from" and not "towards" something.
I must admit I'm not enthused by how we just add whatever.
- Can we please have an abstraction where we just have a single line of
profile.start_segment("foo")or so in various places (plus one start and end in some decorator) and then that figures out what to do (nvtx or no-op, I don't think layers of automated fallback is good). Preferable, this abstraction would live outsidethundet/__init__.py. Then at the end have something.
"have something" ... -- did you mean to end the sentence here?
Anyway yes, I am happy to move all this logic to a separate file and create a decorator there. I'm not sure I understand your start_segment comment though: the "plus" is confusing here, as I imagine with start/end in a decorator we wouldn't need the start_segment, right? Did I misunderstand something?
- If we decide that this is how we profile, let's get rid of all times in
pref_counter_ns. We can't just pile on without removing the approaches that didn't work out.
"If we decide" -- this is really a you decision, Tom ;-). What do you want to see?
I would argue that nvtx is the only method that composes. If Thunder has some custom API for pulling out times, and nvFuser has its own way, and NeMo has another way, and xyz library has yet a different way---one has to write a bunch of code to get a coherent view of what's going on. That's pretty daunting.
That said, again it's your choice TomV---where would you like the project to go?
"If we decide" -- this is really a you decision, Tom
I'm glad we get a new way, but let's remove the old, too.
TomV, Ivan, and I talked offline.
I am going to repurpose this PR to add a decorator that uses either the nvtx library or is a no-op. It won't actually be used at this point. Ivan will restructure thunder.jit to be more amenable to having a decorator attached. Then we'll come back and apply the decorator at key points.
We have looked at nvtx overhead in C, and it's on the order of 300-800 nanoseconds (incl. both the 'start' and 'end' markers) when the profiler is attached, and within the noise of timer resolution when the profiler is not attached. This would imply that we probably don't want to nvtx anything below ~2us. However, the overhead may be worse in Python than it is in C; Ivan is going to measure.
Below are the timings for the call overhead in Python when the profiler is not attached. We can see that the decorator from nvtx is cheap even with the domain specified. Interestingly, the NVTX decorator from PyTorch is 1000x slower than from the nvtx Python package.
In [1]: import nvtx
In [2]: def f(): pass
In [3]: nvtxed_f = nvtx.annotate("test")(f)
In [4]: %%timeit
...: nvtxed_f()
...:
...:
129 ns ± 0.147 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: import torch
In [6]: torch_nvtxed_f = torch.cuda.nvtx.range("test")(f)
In [7]: %%timeit
...: torch_nvtxed_f()
...:
...:
1.26 µs ± 2.71 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [8]: nvtxed_f_with_domain = nvtx.annotate("test", domain="thunder")(f)
In [9]: %%timeit
...: nvtxed_f_with_domain()
...:
...:
129 ns ± 0.273 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
A note for one of your requests, @t-vi: mixology is actually using some of the manual timer stuff that is in there. As such, I'd appreciate if we could have a period where both mechanisms exist in the code (as ugly as that is), to give the mixology team time to update their code accordingly. I am happy to take on removing legacy mechanisms after mixology has upgraded, though.
Sounds like a plan, who is driving the upgrade to the new method at mixology?
Sounds like a plan, who is driving the upgrade to the new method at mixology?
Nobody yet; I've given a heads-up this is coming but it's not actionable for them until the new mechanism is in place.
I'll remind them at a Tuesday meeting and maybe we can come up with a loose schedule/timeline. I don't believe it's reasonable for me to push hard on them to schedule this until we have our own schedule for when they could start, sadly.
ping @t-vi this is waiting on your review and/or comment that we should extend this to include torch.profile.profile too.
Sounds like a plan, who is driving the upgrade to the new method at mixology?
Nobody yet; I've given a heads-up this is coming but it's not actionable for them until the new mechanism is in place.
Martyna and I talked and they actually use a different mechanism for identifying compilation time. So, we can rip out the old timers whenever.
I'd like to avoid ripping out the unused timers in this PR as it will likely produce conflicts for Ivan's #1333, but I am happy to do it in general.