jingxu9x
jingxu9x
You may get some error like: ``` In file included from /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/shared_ptr.h:53: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'? __attribute__((__noinline__)) ^ /opt/rocm/hip/include/hip/amd_detail/host_defines.h:50:37: note: expanded from macro...
DotProductAttention implementation multiplies the wrong scaling factor This PR provider a simple fix https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/dot_product_attention.py#L67-L81
we can split batch's sequence-length before broadcast in tp_group, which can save time in get_batch