Nicolas Macchioni comments

Results 27 comments of


                                            Nicolas Macchioni

[ROCm] Update triton pin to fix libtanh issue

Looks like the change is here: https://github.com/triton-lang/triton/commit/d830823914664303234cce00924c98d72ce1b72c New package needed: "llnl-hatchet" https://github.com/LLNL/hatchet

verbose cache entries for gemm tunings

> would you mind posting what new output is? updated the summary

verbose cache entries for gemm tunings

> It's for debugging, maybe we could add these as log entries ? ideally we centralize the debugging apis under TORCH_LOGS Generally I would agree, but I think parsing these...

prune back configs

@eellison force merge should be alright here, we know perf is good and I don't see how this could break things.

Flush Exact L2 Cache Size in Benchmarking

> > Local testing shows that this is about 70ms faster per-flush on H100. > > > > H100 has 2-3TB/s of memory bandwidth, depending on exactly which version you...

Flush Exact L2 Cache Size in Benchmarking

> (A problem with flushing exactly the L2 cache size is that you don't actually know that this clears the cache. For example H100 has two separate L2s, and I...

Flush Exact L2 Cache Size in Benchmarking

> Autotuning is really a black art. There are thermal considerations ("warmup" can make the GPU hotter and slow it down!), and the exact numeric values you push through the...

Flush Exact L2 Cache Size in Benchmarking

> Does this mean we're zero'ing 256mb 100+400 times? In other words we're touching 256mb * 500 = 128GB. At 2TB/s, this should still take well under 1 second. How...

Flush Exact L2 Cache Size in Benchmarking

> 18s for 5TB is still only 277GB/s. AFAICT we're calling `torch.zero` to do the flush. If that runs at 1/10 the memory bandwidth on H100, that seems like a...

Flush Exact L2 Cache Size in Benchmarking

Wondering if this makes sense... Flushing 256MB at 2TB/s should take roughly 1/((2*1000*1000)/256)s = 0.000128s. By the same logic, flushing 50MB (H100 L2 cache size) would take 0.000025s. So, trading...