zhang662817 issues

Results 7 issues of


                                            zhang662817

Can Bidirectional Rnn and multi-layer Rnn be supported?

It seems than Bidirectional Rnn and multi-layer Rnn can't be supported? Do you have plans to support these?

[QST] how to avoid register spill for example 48

commit id: 757275f2796bb901575c633e2a32bc76ca84ffec device arch: hopper; change LayoutA to cutlass::layout::ColumnMajor; change LayoutB to cutlass::layout::RowMajor; ![image](https://github.com/NVIDIA/cutlass/assets/20987824/6e699c3a-d450-40b8-b405-04e567b60617) kernel will run RS kernel; profiling result: ![image](https://github.com/NVIDIA/cutlass/assets/20987824/05a797b2-c332-49c4-8c3c-818aa6140b6e) register spill; change Tile to Shape; no...

question

inactive-30d

inactive-90d

[FEA] cute hopper conv example

how to implement general conv fwd/dgrad/wgrad by cute? could you give examples based on hopper cute?

feature request

help wanted

inactive-30d

inactive-90d

CuTe

Sparsity support

### Branch/Tag/Commit main ### Docker Image Version pytorh ### GPU name A100 ### CUDA Driver main ### Reproduced Steps ```shell In support matrix, only bert and encode support Sparsity (after...

bug

[BUG] Crash when enable --tp-comm-overlap

**Describe the bug** crash when enable --tp-comm-overlap in examples/pretrain_gpt_distributed_with_mp.sh ![image](https://github.com/NVIDIA/Megatron-LM/assets/20987824/a7981e59-afbf-4d6a-9d7a-1538d1d94a09) **To Reproduce** ![image](https://github.com/NVIDIA/Megatron-LM/assets/20987824/601fd895-0390-42fb-b5e1-d3e16089235f) **Environment (please complete the following information):** - Megatron-LM commit ID: 9290c730d04b482be8fae92a4186fe4ff0c95270 - PyTorch Docker: nvcr.io/nvidia/pytorch 23.10-py3

[QST]Question about Utccpop

https://github.com/NVIDIA/cutlass/blob/c2ad7c5b20f131c4ba33601860f1da3f9c9df0f3/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp#L834 For sm_pair case， ScaleFactorB in smem should multicast to leader and peer cta tmem, which triggered by leader cta; ScaleFactorA in leader cta smem should copy to leader tmem...

question

? - Needs Triage

output tokens in intranode::dispatch for each expert isn't packed?

The output token in the intranode::dispatch kernel offsets is channel_offset + rank_offsets + recv_token_idx; The tokens for each expert isn't contiguous? If yes, do you have plan to optimize these?...