llama.cpp feat: Support Moore Threads GPU

Moore Threads, a cutting-edge GPU startup, introduces MUSA (Moore Threads Unified System Architecture) as its foundational technology. This pull request marks the initial integration of MTGPU support into llama.cpp, leveraging MUSA's capabilities to enhance LLM inference performance.

Similar to https://github.com/ggerganov/llama.cpp/pull/1087, CUDA APIs are replaced by MUSA APIs using macros, and a new build option is added to Makefile and CMake.

# make
make GGML_MUSA=1

# CMake
cmake -B build -DGGML_MUSA=ON
cmake --build build --config Release

I also sent a PR to Ollama to integrate MTGPU to it and all the tests were performed through Ollama. Tested models are:

tinyllama:latest (1b)
llama3:latest (8b)
qwen2:72b

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Jul 09 '24 01:07 yeahdongcn

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

Jul 09 '24 07:07 JohannesGaessler

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Jul 09 '24 07:07 yeahdongcn

@JohannesGaessler @slaren I've addressed most of your comments—thank you again for the review. However, two comments related to compilation remain unresolved. I am currently collaborating with our compiler team to address these issues, but it may take longer than anticipated. Are there any other concerns regarding the remaining changes?

Jul 22 '24 01:07 yeahdongcn

Eventually we should move all the HIP and MUSA-specific code to its own headers.

No problem. I can start working on this.

Jul 25 '24 00:07 yeahdongcn

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

Jul 25 '24 09:07 JohannesGaessler

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

Jul 25 '24 10:07 yeahdongcn

make error

ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1 Please help me.

Jul 29 '24 02:07 1823616178

My video card is s80,cpu is amd 2600

Jul 29 '24 02:07 1823616178

make error

ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

Jul 29 '24 02:07 yeahdongcn

make error ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated. The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

Aug 29 '24 07:08 XenoAmess

make error ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated. The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

I guess we'll have to wait for the next version of the SDK

Aug 29 '24 09:08 1823616178

I guess we'll have to wait for the next version of the SDK

Yes, please give us more time.

Aug 30 '24 00:08 yeahdongcn

@yeahdongcn please help - have got the problem with compiling llama.cpp using MUSA SDK rc2.0.0 with Ubuntu 20.04.6 LTS: running make GGML_MUSA=1 shows the following error: Screenshot 2024-09-19 141040 Is there something I am doing wrong for the compilation?

Sep 19 '24 11:09 Ivening

@Ivening We are still working on MTT S80 support, please see: https://github.com/ggerganov/llama.cpp/pull/9526

If you are interested in running llama.cpp on MTT S80, please add me through WeChat: yeahdongcn.

Sep 19 '24 12:09 yeahdongcn

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

Sep 19 '24 12:09 Ivening

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

Haha, it seems that you're one of our business customers! MTT S3000 shares the same architecture as MTT S80, I can test on MTT S3000 as well.

Sep 20 '24 00:09 yeahdongcn

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

Oct 22 '24 06:10 arch-btw

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

~15 tokens/s (llama3.1:8b)

Please also see the recording on llama3.2:1b:

Oct 22 '24 08:10 yeahdongcn

@yeahdongcn very good, thank you

Oct 22 '24 14:10 arch-btw