Syed Tousif Ahmed comments

Results 26 comments of


                                            Syed Tousif Ahmed

[MXFP8] unable to run titan llama3 debug model with mxfp8. Assertion: n_rows % max_row_tile_size == 0

Verified that @vkuzo's comment is correct. I'm able to run the debug model if I filter the output layer: ``` NGPU=4 ./run_train.sh --model.print_after_conversion --training.compile --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8"...

Benchmark SymmMem's all_to_all_vdev_2d on NVL72

Thanks @tianyu-l ! I'll try to take a look at your PR.

NCCL kernels take longer when composing CUDAGraph with SimpleFSDP

@BoyuanFeng Do you by any chance have some environment information for this regression - machine config, NCCL version etc.?

NVFP4 MoE Training Status

CC: @slayton58 @ngimel @supriyar @Priyadlfw @ptrblck @eqy Please feel free to add anything missing or suggest updates.

NVFP4 MoE Training Status

Thanks @slayton58 . My bad, cuBLAS plans to support Grouped NVFP4 GEMM in future CUDA release. I've updated the text.

NVFP4 MoE Training Status

Added this issue for NVFP4 CuteDSL specific discussion: https://github.com/pytorch/pytorch/issues/166611. The nvfp4 gemm is linked there. And that is indeed the grouped scaled mm example. Let's followup on that thread what's...