Saaketh Narayan comments

Results 51 comments of


                                            Saaketh Narayan

h100 fp8 gemm with fp16-to-fp8 casting from the load make the performance bad

With the latest triton nightly, I'm also running into this issue when casting bf16 inputs to fp8 right before `tl.dot`. I'm setting `AB_DTYPE` to `tl.float8e4nv` before calling the matmul kernel...

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

Knew I was forgetting something :) updated the description above!

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

Do you think this may be due to using RoCE?

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

Here's the output of `ucx_info -v`: ``` # Library version: 1.15.0 # Library path: /opt/hpcx/ucx/lib/libucs.so.0 # API headers version: 1.15.0 # Git branch '', revision bf8f1b6 # Configured with: --disable-logging...

Understanding Triton GEMM FP8 performance

Hey @ThomasRaoux, I was profiling the performance of triton fp8 gemm as well and came across this issue. I'm still observing the same performance degradation as above when using `fp8_fast_accum=False`....

Understanding Triton GEMM FP8 performance

Great thanks. I'll try to take a look. As for the comparisons, at least with `te_gemm`, the accumulation types were the same. I'm a bit new to triton -- it...

Understanding Triton GEMM FP8 performance

Sorry, I meant that the accumulation precision was the same for both `te_gemm` and for the triton matmul kernel in my benchmarking. Will take a look at that example, thanks!

Per-stream processing

Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!

Modify StreamingDataset to support passing process_group as construct…

Hey @jasonkrone, thanks for submitting this PR! So two things: 1. we're trying to reduce our dependency on torch.distributed because it can, at times, cause some messy issues with more...

Modify StreamingDataset to support passing process_group as construct…

Ah I see @knighton already discussed much of this with you on the community slack.