Haitian Jiang comments

Repositories
Issues
Comments

Results 3 comments of


                                            Haitian Jiang

cuBLAS Error

Same issue, TE 2.1.0, torch 2.5.1+cu124, cuda 12.4, cudnn 9.8.0. TE 1.13.0 works fine with my environment.

[BUG] Context parallel gives NCCL error

Same problem encountered.

[BUG] Context parallel gives NCCL error

The environment variable `NVTE_BATCH_MHA_P2P_COMM` needs to be set as 1, then this error will not occur. See the transformer_engine code here: https://github.com/NVIDIA/TransformerEngine/blob/303c6d16203b3cb01675f7adb7c21956f140e0ee/transformer_engine/pytorch/attention.py#L1869