Saaketh Narayan

Results 51 comments of Saaketh Narayan

With the latest triton nightly, I'm also running into this issue when casting bf16 inputs to fp8 right before `tl.dot`. I'm setting `AB_DTYPE` to `tl.float8e4nv` before calling the matmul kernel...

Knew I was forgetting something :) updated the description above!

Do you think this may be due to using RoCE?

Here's the output of `ucx_info -v`: ``` # Library version: 1.15.0 # Library path: /opt/hpcx/ucx/lib/libucs.so.0 # API headers version: 1.15.0 # Git branch '', revision bf8f1b6 # Configured with: --disable-logging...

Hey @ThomasRaoux, I was profiling the performance of triton fp8 gemm as well and came across this issue. I'm still observing the same performance degradation as above when using `fp8_fast_accum=False`....

Great thanks. I'll try to take a look. As for the comparisons, at least with `te_gemm`, the accumulation types were the same. I'm a bit new to triton -- it...

Sorry, I meant that the accumulation precision was the same for both `te_gemm` and for the triton matmul kernel in my benchmarking. Will take a look at that example, thanks!

Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!

Hey @jasonkrone, thanks for submitting this PR! So two things: 1. we're trying to reduce our dependency on torch.distributed because it can, at times, cause some messy issues with more...

Ah I see @knighton already discussed much of this with you on the community slack.