wdykas
wdykas
It enforces the order of kernel execution on GPU as the kernel queuing order from host. Its for GEMM and TP communication overlap it allows for scheduling the communication kernel...
is there any solution here without using ray?
fixed by changing indexing to int64 in kernels
> Could you please follow the instructions [here](https://github.com/NVIDIA/TransformerEngine/pull/2357/checks?check_run_id=55026790643) to fix the DCO? Thanks! I think this is done?
/ok to test 2190b222535877c9b9be596b30c2dda27a3e6205
/ok to test [9641c38](https://github.com/NVIDIA/Megatron-LM/pull/2379/commits/9641c38a6a2dde146109bf81ba33d03ede95383b)
/ok to test [12dc7ae](https://github.com/NVIDIA/Megatron-LM/pull/2379/commits/12dc7ae19c7aa512a629046e3d3ae88055e5e5d0)
/ok to test [70272da](https://github.com/NVIDIA/Megatron-LM/pull/2379/commits/70272da9719189d4af7dd9bd176b5ae0eca2e9d3)
/ok to test [cc5b44b](https://github.com/NVIDIA/Megatron-LM/pull/2379/commits/cc5b44bebf2a496ffe206a5a3f408396574a707f)
/ok to test 7254f0f7c0e6baea28be93c3f2fd32b7c2b452a5