perf: Use pinned H2D to reduce bubbles
In some cases, some pageable H2D operations are followed by cudaStreamSynchronize operations, which block kernel launches on CPU. This problem can be solved by changing pageable H2D to pinned H2D.
/bot run
PR_Github #688 [ run ] triggered by Bot
PR_Github #688 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #577 completed with status: 'SUCCESS'
/bot run --add-multi-gpu-test
PR_Github #1096 [ run ] triggered by Bot
PR_Github #1096 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #836 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test
PR_Github #1165 [ run ] triggered by Bot
PR_Github #1165 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #874 completed with status: 'SUCCESS'
/bot reuse-pipeline
PR_Github #1176 [ reuse-pipeline ] triggered by Bot
PR_Github #1176 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #1165 for commit 2835f2b
/bot reuse-pipeline
PR_Github #1178 [ reuse-pipeline ] triggered by Bot
PR_Github #1178 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #1165 for commit 2efb7da