Syed Tousif Ahmed

Results 26 comments of Syed Tousif Ahmed

Verified that @vkuzo's comment is correct. I'm able to run the debug model if I filter the output layer: ``` NGPU=4 ./run_train.sh --model.print_after_conversion --training.compile --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8"...

Thanks @tianyu-l ! I'll try to take a look at your PR.

@BoyuanFeng Do you by any chance have some environment information for this regression - machine config, NCCL version etc.?

CC: @slayton58 @ngimel @supriyar @Priyadlfw @ptrblck @eqy Please feel free to add anything missing or suggest updates.

Thanks @slayton58 . My bad, cuBLAS plans to support Grouped NVFP4 GEMM in future CUDA release. I've updated the text.

Added this issue for NVFP4 CuteDSL specific discussion: https://github.com/pytorch/pytorch/issues/166611. The nvfp4 gemm is linked there. And that is indeed the grouped scaled mm example. Let's followup on that thread what's...