Rohan Varma
Rohan Varma
# [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug ## Description This PR adds in a flag for bidirectional_attention to `modeling_llama.py`. This is...
Flashinfer seems to be missing the latest MOE comm kernels for multinode-nvlink/gb200. TRTLLM's path is mnnvl_moe_alltoallv_combine -> torch.ops.trtllm.moe_comm -> moeCommOp -> tensorrt_llm::kernels::moeAllToAll, see [https://github.com/NVIDIA/TensorRT-LLM/blob/222bc911cd35405f3539c366da6c03c00e9a7fb7/cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu#L1406](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2FTensorRT-LLM%2Fblob%2F222bc911cd35405f3539c366da6c03c00e9a7fb7%2Fcpp%2Ftensorrt_llm%2Fkernels%2FfusedMoeCommKernels.cu%23L1406&data=05%7C02%7Crohanv%40nvidia.com%7C14b30ef960d549a1baab08de23053810%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638986702517301506%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=zFUC6%2FdVn8mVyY0OHq8zTmRxUxScao3sQZOTovM1bMw%3D&reserved=0) Flashinfer's path is flashinfer.comm.trtllm_alltoall.mnnvl_moe_alltoallv_combine -> moe_comm...
### Overview The current Workload API in KEP-4671 supports intra-PodGroup gang scheduling effectively - all pods within a PodGroup (or PodGroup replica) can be required to schedule together. However, the...