DeepSpeed [Bug Fix] Support threads_per_head < 64 for wavefront size of 64

When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16.

Remove the condition to check ROCm and wavefront size check.

Oct 11 '24 23:10 jagadish-amd

ping @jithunnair-amd @jeffdaily @loadams

Oct 14 '24 19:10 jagadish-amd

@loadams any comments on this PR?

Oct 17 '24 20:10 jagadish-amd

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

Oct 30 '24 19:10 loadams

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

Oct 31 '24 01:10 jagadish-amd

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

@loadams I have added the test case to test the threads_per_head ,warp size alignment issue. Unfortunately, I lost access to the AI model / node for which the "Assertion `false' failed" error had triggered on warp_size 64 node. Hence the values in the test cases are assumed here, but it still tests the intended fix. I will add the exact values in the future. I noticed that the files in unit/ops/transformer/inference were refactored. I have used the InferenceBuilder().load() way to test the function apply_rotary_pos_emb. If this is not right, plz let me know, we can merge the PR without the newly added test. (and later folks can add the test cases, this change affects only warp_size 64 case / AMD Instinct device).

These are the results. On warp_size = 32, test cases pass regardless of the changes in apply_rotary_pos_emb.cu ==================================== 4 passed, 2 warnings in 25.24s ==================================== On warp_size =64, with the fix. ==================================== 4 passed, 2 warnings in 5.58s ===================================== On warp_size =64, without the fix. python: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:169: void launch_apply_rotary_pos_emb(T *, T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, hipStream_t, int) [T = float]: Assertion `false' failed. Fatal Python error: Aborted

The test run is aborted (as expected) due to the error in kernel. Not sure if there is better way to handle this?

Nov 04 '24 08:11 jagadish-amd

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

@loadams I have added the test case to test the threads_per_head ,warp size alignment issue. Unfortunately, I lost access to the AI model / node for which the "Assertion `false' failed" error had triggered on warp_size 64 node. Hence the values in the test cases are assumed here, but it still tests the intended fix. I will add the exact values in the future. I noticed that the files in unit/ops/transformer/inference were refactored. I have used the InferenceBuilder().load() way to test the function apply_rotary_pos_emb. If this is not right, plz let me know, we can merge the PR without the newly added test. (and later folks can add the test cases, this change affects only warp_size 64 case / AMD Instinct device).

These are the results. On warp_size = 32, test cases pass regardless of the changes in apply_rotary_pos_emb.cu ==================================== 4 passed, 2 warnings in 25.24s ==================================== On warp_size =64, with the fix. ==================================== 4 passed, 2 warnings in 5.58s ===================================== On warp_size =64, without the fix. python: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:169: void launch_apply_rotary_pos_emb(T *, T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, hipStream_t, int) [T = float]: Assertion `false' failed. Fatal Python error: Aborted

The test run is aborted (as expected) due to the error in kernel. Not sure if there is better way to handle this?

@jagadish-amd - this should be fine for now. I believe the only remaining thing for this PR is the CLA agreement, you should just need to reply to it with accept and company as AMD.

Nov 04 '24 17:11 loadams

@jagadish-amd please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="AMD"

Nov 04 '24 17:11 jagadish-amd