Nicolas Macchioni

Results 27 comments of Nicolas Macchioni

Looks like the change is here: https://github.com/triton-lang/triton/commit/d830823914664303234cce00924c98d72ce1b72c New package needed: "llnl-hatchet" https://github.com/LLNL/hatchet

> would you mind posting what new output is? updated the summary

> It's for debugging, maybe we could add these as log entries ? ideally we centralize the debugging apis under TORCH_LOGS Generally I would agree, but I think parsing these...

@eellison force merge should be alright here, we know perf is good and I don't see how this could break things.

> > Local testing shows that this is about 70ms faster per-flush on H100. > > > > H100 has 2-3TB/s of memory bandwidth, depending on exactly which version you...

> (A problem with flushing exactly the L2 cache size is that you don't actually know that this clears the cache. For example H100 has two separate L2s, and I...

> Autotuning is really a black art. There are thermal considerations ("warmup" can make the GPU hotter and slow it down!), and the exact numeric values you push through the...

> Does this mean we're zero'ing 256mb 100+400 times? In other words we're touching 256mb * 500 = 128GB. At 2TB/s, this should still take well under 1 second. How...

> 18s for 5TB is still only 277GB/s. AFAICT we're calling `torch.zero` to do the flush. If that runs at 1/10 the memory bandwidth on H100, that seems like a...

Wondering if this makes sense... Flushing 256MB at 2TB/s should take roughly 1/((2*1000*1000)/256)s = 0.000128s. By the same logic, flushing 50MB (H100 L2 cache size) would take 0.000025s. So, trading...