jingxu9x issues

Results 3 issues of


                                            jingxu9x

conflict 'noinline' attribute between GNUC compiler and HIP

You may get some error like: ``` In file included from /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/shared_ptr.h:53: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'? __attribute__((__noinline__)) ^ /opt/rocm/hip/include/hip/amd_detail/host_defines.h:50:37: note: expanded from macro...

[BUG] wrong scale softmax for local transformer implement

DotProductAttention implementation multiplies the wrong scaling factor This PR provider a simple fix https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/dot_product_attention.py#L67-L81

stale

OPTIM get_batch traffic when enable context-parallel

we can split batch's sequence-length before broadcast in tp_group, which can save time in get_batch

stale