Nicolas Macchioni
Nicolas Macchioni
Looks like the change is here: https://github.com/triton-lang/triton/commit/d830823914664303234cce00924c98d72ce1b72c New package needed: "llnl-hatchet" https://github.com/LLNL/hatchet
> would you mind posting what new output is? updated the summary
> It's for debugging, maybe we could add these as log entries ? ideally we centralize the debugging apis under TORCH_LOGS Generally I would agree, but I think parsing these...
@eellison force merge should be alright here, we know perf is good and I don't see how this could break things.
> > Local testing shows that this is about 70ms faster per-flush on H100. > > > > H100 has 2-3TB/s of memory bandwidth, depending on exactly which version you...
> (A problem with flushing exactly the L2 cache size is that you don't actually know that this clears the cache. For example H100 has two separate L2s, and I...
> Autotuning is really a black art. There are thermal considerations ("warmup" can make the GPU hotter and slow it down!), and the exact numeric values you push through the...
> Does this mean we're zero'ing 256mb 100+400 times? In other words we're touching 256mb * 500 = 128GB. At 2TB/s, this should still take well under 1 second. How...
> 18s for 5TB is still only 277GB/s. AFAICT we're calling `torch.zero` to do the flush. If that runs at 1/10 the memory bandwidth on H100, that seems like a...
Wondering if this makes sense... Flushing 256MB at 2TB/s should take roughly 1/((2*1000*1000)/256)s = 0.000128s. By the same logic, flushing 50MB (H100 L2 cache size) would take 0.000025s. So, trading...