zhang662817
zhang662817
It seems than Bidirectional Rnn and multi-layer Rnn can't be supported? Do you have plans to support these?
commit id: 757275f2796bb901575c633e2a32bc76ca84ffec device arch: hopper; change LayoutA to cutlass::layout::ColumnMajor; change LayoutB to cutlass::layout::RowMajor;  kernel will run RS kernel; profiling result:  register spill; change Tile to Shape; no...
how to implement general conv fwd/dgrad/wgrad by cute? could you give examples based on hopper cute?
### Branch/Tag/Commit main ### Docker Image Version pytorh ### GPU name A100 ### CUDA Driver main ### Reproduced Steps ```shell In support matrix, only bert and encode support Sparsity (after...
**Describe the bug** crash when enable --tp-comm-overlap in examples/pretrain_gpt_distributed_with_mp.sh  **To Reproduce**  **Environment (please complete the following information):** - Megatron-LM commit ID: 9290c730d04b482be8fae92a4186fe4ff0c95270 - PyTorch Docker: nvcr.io/nvidia/pytorch 23.10-py3
https://github.com/NVIDIA/cutlass/blob/c2ad7c5b20f131c4ba33601860f1da3f9c9df0f3/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp#L834 For sm_pair case, ScaleFactorB in smem should multicast to leader and peer cta tmem, which triggered by leader cta; ScaleFactorA in leader cta smem should copy to leader tmem...
The output token in the intranode::dispatch kernel offsets is channel_offset + rank_offsets + recv_token_idx; The tokens for each expert isn't contiguous? If yes, do you have plan to optimize these?...