zhang662817

Results 7 issues of zhang662817

It seems than Bidirectional Rnn and multi-layer Rnn can't be supported? Do you have plans to support these?

commit id: 757275f2796bb901575c633e2a32bc76ca84ffec device arch: hopper; change LayoutA to cutlass::layout::ColumnMajor; change LayoutB to cutlass::layout::RowMajor; ![image](https://github.com/NVIDIA/cutlass/assets/20987824/6e699c3a-d450-40b8-b405-04e567b60617) kernel will run RS kernel; profiling result: ![image](https://github.com/NVIDIA/cutlass/assets/20987824/05a797b2-c332-49c4-8c3c-818aa6140b6e) register spill; change Tile to Shape; no...

question
inactive-30d
inactive-90d

how to implement general conv fwd/dgrad/wgrad by cute? could you give examples based on hopper cute?

feature request
help wanted
inactive-30d
inactive-90d
CuTe

### Branch/Tag/Commit main ### Docker Image Version pytorh ### GPU name A100 ### CUDA Driver main ### Reproduced Steps ```shell In support matrix, only bert and encode support Sparsity (after...

bug

**Describe the bug** crash when enable --tp-comm-overlap in examples/pretrain_gpt_distributed_with_mp.sh ![image](https://github.com/NVIDIA/Megatron-LM/assets/20987824/a7981e59-afbf-4d6a-9d7a-1538d1d94a09) **To Reproduce** ![image](https://github.com/NVIDIA/Megatron-LM/assets/20987824/601fd895-0390-42fb-b5e1-d3e16089235f) **Environment (please complete the following information):** - Megatron-LM commit ID: 9290c730d04b482be8fae92a4186fe4ff0c95270 - PyTorch Docker: nvcr.io/nvidia/pytorch 23.10-py3

https://github.com/NVIDIA/cutlass/blob/c2ad7c5b20f131c4ba33601860f1da3f9c9df0f3/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp#L834 For sm_pair case, ScaleFactorB in smem should multicast to leader and peer cta tmem, which triggered by leader cta; ScaleFactorA in leader cta smem should copy to leader tmem...

question
? - Needs Triage

The output token in the intranode::dispatch kernel offsets is channel_offset + rank_offsets + recv_token_idx; The tokens for each expert isn't contiguous? If yes, do you have plan to optimize these?...