triton Flush Exact L2 Cache Size in Benchmarking

I'm currently working on reducing Inductor's compile time overhead in max-autotune-gemm mode. As part of this effort, I profiled some individual matmul autotunings and noticed that do_bench was particularly expensive. At first I thought, "well that makes sense, that is where the benchmarking takes place after all", but something still seemed off. Benchmarking ~30 template choices, for a relatively small kernel whose runtime should be about ~0.1ms, was still taking upwards of 30s. Numbers didn't seem to add up; we use the default warmup/rep settings (25ms and 100ms, respectively), and I was assuming even with overhead we shouldn't be taking more than 0.5s to benchmark each choice. So, I dug into the source code of do_bench. Some lines immediately jumped out at me...

# We maintain a buffer of 256 MB that we clear # before each kernel call to make sure that the L2 # doesn't contain any input data before the run if fast_flush: cache = torch.empty(int(256e6 // 4), dtype=torch.int, device='cuda') else: cache = torch.empty(int(256e6), dtype=torch.int8, device='cuda').

Wow, 256MB seems like a lot to flush especially if we're doing so for each timing iteration. For small kernels we might do >1000 timing iterations, that's a total of 256GB flushed to benchmark a single template choice!

Just to be sure, I went ahead and removed the cache flushing. Obviously, the performance results were totally bogus as we were now hitting the cache left and right! But, more importantly, the benchmarking overhead dropped like a rock! We were now observing an overhead much closer to warmup + rep.

So, is there a way to reduce the size of the flush without impacting performance? Yes! We can flush the actual cache size, that should reduce overhead without impacting performance! We tried exactly that; by querying and flushing the exact L2 cache size from the third-party drivers we were able to reduce the runtime of the matrix multiplication tutorial from 48s to 30s on our local H100 machine. Important notice here that we decided to hardcode n_warmup/n_repetition to 100/400 since the current calculation to set these values is actually dependent on the cache flushing overhead (since the cache flushing is included in the runtime estimation timing).

But do these numbers make sense? Let's check it out. An H100 has a 50MB L2 cache, so we've effectively reduced the size of each cache flush by 206MB. Cache flushing doesn't occur in the warmup stage, so for each benchmark we only do 5 (estimation loop) + 400 (timing loop) = 405 individual flushes. Saving 206MB per flush that is a total of 83.43GB of flushing saved per benchmark. The matrix multiplication tutorial covers 31 different matmuls, and each is given 16 different configs during the autotuning stage. So, in total we need to do 31 * 16 = 496 benchmarks. That gives us an overall savings of ~41.4TB of flushing! Assuming all cache flushes occur at or close to the 2TB/s memory bandwidth limit on H100, this equates to ~20s saved. This seems to check out, since in practice we saved 18s!

May 19 '24 08:05 nmacchioni

Local testing shows that this is about 70ms faster per-flush on H100.

H100 has 2-3TB/s of memory bandwidth, depending on exactly which version you have. That's equivalent to 2-3GB/ms. I'm very suspicious that touching a mere 256MB would take 70ms, that's ridiculously slow. (Even though I see that we touch it 5 times, that's still ridiculous.)

May 19 '24 12:05 jlebar

(A problem with flushing exactly the L2 cache size is that you don't actually know that this clears the cache. For example H100 has two separate L2s, and I don't know how they work. But again, touching 4MB vs 256MB should not be a substantial difference, certainly not >1ms on H100, so if you're observing that, it seems like something is wrong.)

May 19 '24 12:05 jlebar

Local testing shows that this is about 70ms faster per-flush on H100.

H100 has 2-3TB/s of memory bandwidth, depending on exactly which version you have. That's equivalent to 2-3GB/ms. I'm very suspicious that touching a mere 256MB would take 70ms, that's ridiculously slow. (Even though I see that we touch it 5 times, that's still ridiculous.)

I got this measurement by calculating the average difference between the estimated runtime and the benchmarked runtime, which makes me realize now that it might be 0.7ms.

May 19 '24 15:05 nmacchioni

(A problem with flushing exactly the L2 cache size is that you don't actually know that this clears the cache. For example H100 has two separate L2s, and I don't know how they work. But again, touching 4MB vs 256MB should not be a substantial difference, certainly not >1ms on H100, so if you're observing that, it seems like something is wrong.)

I'm also no expert on how the H100 L2 cache works, so maybe this is something wrong on my end like you suggest? I'd be interested to know if you see the same speedups on the matrix multiplication tutorial as I d, although for it to be a fair comparison we'd likely have to hardcode warmup/rep iterations instead of computing those based on the estimated runtime.

May 19 '24 15:05 nmacchioni

I'd be interested to know if you see the same speedups on the matrix multiplication tutorial as I d[o]

I mean, yeah, if you change the number of iterations, that's going to change the runtime... It's also potentially going to change the performance.

I don't have a suggestion on how to clear the L2 better or how to "tell" that we're hitting the entire L2. I do know that B100 has two dies, so probably has an even more NUMA L2 architecture...

Autotuning is really a black art. There are thermal considerations ("warmup" can make the GPU hotter and slow it down!), and the exact numeric values you push through the GPU can affect performance... I'm not saying the existing code gets it right, by any means. But I'd rather not mess around with it, since it's so hard to say whether X change is an improvement, especially since the premise of this bug is that 256MB takes 70ms, which we no longer think is correct.

If you really need this at Meta, we can consider making the parameters configurable?

May 19 '24 18:05 jlebar

Autotuning is really a black art. There are thermal considerations ("warmup" can make the GPU hotter and slow it down!), and the exact numeric values you push through the GPU can affect performance... I'm not saying the existing code gets it right, by any means. But I'd rather not mess around with it, since it's so hard to say whether X change is an improvement, especially since the premise of this bug is that 256MB takes 70ms, which we no longer think is correct.

I totally agree with this. I think autotuning is really, really hard to get right and there is never really a perfect solution for the exact reasons you mentioned. Which, in these cases, is why I like to do thorough testing to make sure that the changes are "good". For context, we're already planning to do this sort of "smart flushing" in Inductor, in combination with heavily reducing the warmup/rep count, since this has been shown to significantly reduce autotuning overhead without affecting E2E performance of the various Torchbench/Huggingface/TIMM models covered by the Inductor performance benchmarking.

I do still think that this smart flushing is significantly faster than the default 256MB flush; to test this I again used the matrix multiplication tutorial, but instead of having variable n_warmup and n_repeat I pinned them to 100 and 400, respectively, so that they would no be influenced by any change in estimation_ms. This reduced the runtime of the tutorial from 48s to 30s.

I think the real crux of the problem here is that n_warmup and n_repeat depend on the cache flush duration. If there is no agreeable way to reduce the default warmup/rep, to account for the decrease cache flush duration, then how about landing the smart flush without altering these values? Then we would still be able to take advantage of the faster flushing, and by default we would now do more repetitions during benchmarking which should give more stable numbers. This would also give of the option to reduce warmup/rep internally by simply adjusting the function variables.

May 19 '24 18:05 nmacchioni

I do still think that this smart flushing is significantly faster than the default 256MB flush; to test this I again used the matrix multiplication tutorial, but instead of having variable n_warmup and n_repeat I pinned them to 100 and 400, respectively, so that they would no be influenced by any change in estimation_ms. This reduced the runtime of the tutorial from 48s to 30s.

Does this mean we're zero'ing 256mb 100+400 times? In other words we're touching 256mb * 500 = 128GB. At 2TB/s, this should still take well under 1 second. How the heck does it add 18 seconds to the total runtime? That works out to a bandwidth of 128GB/18s = 7GB/s, which is hundreds of times slower than it should be.

Before we start optimizing 256mb to something smaller, this is the thing I think we should understand.

I think the real crux of the problem here is that n_warmup and n_repeat depend on the cache flush duration.

I agree in principle that n_warmup and n_repeat should not be "variable". But the cache flush duration should just be a measure of the GPU's memory bandwidth, which to a first approximation is a constant.

how about landing the smart flush without altering these values?

By the "smart flush" dym flushing a smaller amount of memory? If so then see above, this really should not be so slow based on my current understanding (maybe we're missing something?), and I'm concerned that flushing a much smaller amount of memory is not actually guaranteed to flush the weird split L2 cache. If not then lmk wym.

May 19 '24 20:05 jlebar

Does this mean we're zero'ing 256mb 100+400 times? In other words we're touching 256mb * 500 = 128GB. At 2TB/s, this should still take well under 1 second. How the heck does it add 18 seconds to the total runtime? That works out to a bandwidth of 128GB/18s = 7GB/s, which is hundreds of times slower than it should be.

Before we start optimizing 256mb to something smaller, this is the thing I think we should understand.

The tutorial does ~30 matmuls, each is tuned against 16 different configs. Also worth noting that the warmup stage does not flush the cache. So we have 30 * 16 benchmarking runs or ~500 total. That's 400 cache flushes (during actual timing loop) per benchmark or 20,000 flushes. At 256MB per flush that's ~5TB. Compared to flushing the actual L2 cache size of 50MB on H100 we'd do only ~1TB of flushing. At the 2TB/s mark that would save 2s, but in reality we're most definitely not hitting the maximum bandwidth for these flushes.

Sorry I can have a more detailed response in a bit, hopping on a plane in a moment.

May 19 '24 20:05 nmacchioni

18s for 5TB is still only 277GB/s. AFAICT we're calling torch.zero to do the flush. If that runs at 1/10 the memory bandwidth on H100, that seems like a bug in pytorch?

Fix that and you'll go from 18s to 2.5s of overhead, which seems like it would be negligible when the rest of the work takes 30s? And then we wouldn't have to worry about whether touching 50mb actually clears the whole L2, which, again, we do not know given the split L2s.

May 19 '24 20:05 jlebar

18s for 5TB is still only 277GB/s. AFAICT we're calling torch.zero to do the flush. If that runs at 1/10 the memory bandwidth on H100, that seems like a bug in pytorch?

I benchmarked the cache.zero_() step and it definitely should be running at ~2.4TB/s for a 256MB flush, and just a tad slower for 50MB flushes. So you're right, the 4TB reduction definitely doesn't make total sense. Let me double back and check that 5TB number for 256MB flushes.

Fix that and you'll go from 18s to 2.5s of overhead, which seems like it would be negligible when the rest of the work takes 30s? And then we wouldn't have to worry about whether touching 50mb actually clears the whole L2, which, again, we do not know given the split L2s.

I don't have access to test B100 at the moment; I can definitely reach out to Nvidia contacts and ask them about this behavior.

May 19 '24 23:05 nmacchioni

Wondering if this makes sense...

Flushing 256MB at 2TB/s should take roughly 1/((210001000)/256)s = 0.000128s. By the same logic, flushing 50MB (H100 L2 cache size) would take 0.000025s. So, trading a 256MB flush for a 50MB flush would save (0.000128s - 0.000025s) = 0.000103s per flush. From the logic before (400 flushes per benchmark, 30 matmuls, 16 choices per matmul) this should accumulate to (400 * 30 * 16 * 0.000103s) = 19.776s saved in total. This is very close to our actual observed savings, although a slight overestimate which we might attribute to 50MB flushes being slightly less efficient (not hitting the full 2TB/s bandwidth) due to their relatively small size.

May 19 '24 23:05 nmacchioni

I now see the issue in my logic before, 400 * 500 is not 20,000 it's 200,000, oops...

So that line of reasoning still holds up, 256MB flushes will transfer ~50TB of data.

May 19 '24 23:05 nmacchioni

Aha, yay, we understand now. :)

Running each kernel 500 times seems really excessive to me. For example in XLA we'd run each kernel once. :) I wonder if we should tweak this or make it configurable. It'll make everything faster.

May 20 '24 17:05 jlebar

Running each kernel 500 times seems really excessive to me. For example in XLA we'd run each kernel once. :) I wonder if we should tweak this or make it configurable. It'll make everything faster.

Yes! The 500 iterations was just a rough estimate which would correspond to an estimate_ms of 0.2ms; if we factor in the overhead of flushing 256MB, (256MB / 2000MB/ms) = ~0.13ms then really any kernel with a runtime around ~0.07ms would have 500 iterations. In reality, a lot of kernels are even smaller than that leading to a massive number of iterations.

Aha, yay, we understand now. :)

Now that we fully understand why we're seeing a speedup, I could use your opinion on how to make this landable. As we discussed before, the runtime estimation is dependent on the overhead of flushing the cache and as such landing this improvement to the flushing mechanism counterintuitively causes slowdown in practice.

As an example let's use a kernel with an actual runtime of 0.05ms as a baseline. Currently, with the 256MB flush this would have an estimated runtime of ~0.18ms. This would result in (25ms/0.18ms) = ~140 warmups and (100ms/0.18ms) = ~556 benchmarking iterations. If we want to include the 5 estimation iterations, then we have ~560 total iterations that do cache flushing (estimation + benchmarking) and ~140 iterations that don't do cache flushing (warmups). So, that's 700 iterations of the kernel in total, and 560 cache flushes. The total compute time should then be around (700 * 0.05ms) = 35ms for the actual kernel and (560 * 0.13ms) = 72.8ms for the cache flushing, or in total ~108ms.

Now, let's say we swap the 256MB flush for the 50MB flush. The flush now takes ~0.03ms so our estimation would be much lower, ~0.08ms. We'd now have about 312 warmup iterations and 1250 benchmarking iterations. Again with the estimations included, that comes to 1567 iterations of the kernel and 1255 iterations of the flushing. So, our total compute time becomes (1567 * 0.05ms) = 78.35ms for the kernel and 37.65ms for the cache flushing, or a combined ~116ms.

On paper, this doesn't seem like a huge difference. In reality, the slowdown is a bit larger just from the overhead alone of doing >2x the number of iterations. I have a couple solutions in mind, some I hate and some I dislike less.

We could take the slowdown in favor of more iterations. I hate this because it goes against the entire purpose of the PR, and >500 iterations is already way, way too many in my opinion.
We could detach the cache flushing from the estimation timing and scale back up by the theoretical overhead of flushing 256MB. Basically, we would time each iteration of the estimation separately and move the cache flushing outside the timing events; then, we could shift the estimation upwards by (256MB / 2000MB/ms) ~0.13ms (on H100 at least). This would bring us close to the current number of warmup/benchmark iterations. I also hate this idea, because we'd have to calculate the theoretical cache flush overhead based on the current device.
We could reduce the default warmup/benchmark values from the current 25ms/100ms settings. I don't like this idea, since it doesn't scale linearly based on the runtime of the benchmarked kernels. We might be able to get the small kernels more in line with the current warmup/benchmark iterations, but we'd effectively be reducing those numbers for larger kernels whose estimation is not affected as heavily by the cache flush overhead.
We could add variable caps to the number of warmup/benchmark iterations that users can fine tune to their own needs. This idea is not so horrible, but I still don't like the idea of adding additional parameters.
We could detach the cache flushing from the estimation timing and do (4). I think I like this idea the best; we still have the problem of adding parameters, but I think we'd be doing a better job of providing what the user expects. In this case, if the user wants to do 25ms/100ms of warmup/benchmark they can be certain that it is there kernel running for 25ms/100ms and not a combination of their kernel + cache flushing.

Interested to hear your thoughts on this :)

May 21 '24 01:05 nmacchioni

@jlebar pinging in case this got lost in your inbox

May 23 '24 19:05 nmacchioni

Sorry, I don't have cycles to dig very deep into this.

I tried to give the following feedback to Meta during the last public Triton meeting. I don't think I articulated it very well, though.

Someone from Meta said, we want to do X (might have been, slow-but-functional CPU backend) because it will improve Triton. To which I tried to say, there are an infinite number of things one can improve about Triton, while Triton maintainers only have a finite amount of attention. We therefore need to pick our battles.

What I suggested was, let's pick our battles based on real business needs, rather than "it would be nice if X".

For this PR, I would ask, what is the real business need here? Surely it is not "run the Triton matmul tutorial in less than X seconds." Would the business need be met by making parameters here configurable?

May 23 '24 19:05 jlebar

The business need here is that Inductor's max-autotune-gemm introduces too much compile time overhead to be usable for us. As part of the efforts to reduce this overhead, we implemented a fixes to speedup the actual compiling of Triton templates. Now the blocker is mainly this benchmarking time.

Theoretically we could just turn down warmup/rep, which is already an option in do_bench, but that's not really a fix for us since we'd then be sacrificing too many iterations just to cover the overhead of the cache flush.

In reality, the actual business impact of this could be significantly larger than just the max-autotune-gemm case we are investigating. Anything that autotunes with do_bench would see a speedup, that includes the pointwise/reduction autotuning that occurs and any other special case autotunings.

May 23 '24 19:05 nmacchioni

we'd then be sacrificing too many iterations just to cover the overhead of the cache flush.

How many iterations do you need in order to get sufficient accuracy?

In my experience if one uses CUDA events to do the timing (i.e. as opposed to wall time), and if nothing else is running on the GPU, and if thermal throttling doesn't get in the way, the results should be pretty consistent even after 2 runs. 500 iterations per configuration seems really excessive, no?

May 23 '24 20:05 jlebar

How many iterations do you need in order to get sufficient accuracy?

I was able to get sufficient accuracy by setting warmup/rep to 5ms/15ms.

if nothing else is running on the GPU, and if thermal throttling doesn't get in the way, the results should be pretty consistent even after 2 runs.

Unfortunately I think those ideal scenarios are not realistic for actual use cases; both are usually a problem. In my opinion 2 runs is very aggressive, somewhere are 25/50 runs then taking the minimum of those should be stable enough. But to be clear, I think the only way to know for sure is to try it.

500 iterations per configuration seems really excessive, no?

Definitely, and for a lot of kernels that's what the default warmup/rep would do. I don't think the default settings should be that huge, but I guess that's another issue.

May 23 '24 20:05 nmacchioni

if nothing else is running on the GPU, and if thermal throttling doesn't get in the way

Unfortunately I think those ideal scenarios are not realistic for actual use cases

The problem is, running more times and taking the mean does not necessarily address either of these.

somewhere are 25/50 runs then taking the minimum of those should be stable enough.

I agree that min is much more stable than mean, and that's probably what I'd do if I were starting from scratch. But this biases you towards algorithms which perform better when the GPU is cold, and I believe we have seen cases where that isn't globally ideal. Just adding to the complexity...

I was able to get sufficient accuracy by setting warmup/rep to 5ms/15ms. [...] I don't think the default settings should be that huge

Is autotuning still unacceptably slow for your business needs if you set warmup/rep to 5ms/15ms?

May 23 '24 21:05 jlebar

Is autotuning still unacceptably slow for your business needs if you set warmup/rep to 5ms/15ms?

We still have some optimization work to make the autotuning viable for our use case. Also, the 5ms/15ms was with the exact flush; I'd expect those numbers would need to be higher if we used the current flush.

I think I'll just put this PR on the back burner for now. On the PyTorch side we can move away from using do_bench and implement our own benchmark, something we're already working on which allows us to batch together autotunings.

May 23 '24 21:05 nmacchioni