llama.cpp
llama.cpp copied to clipboard
ggml-cpu: Support s390x SIMD Instruction Set
This pull request aims to integrate the SIMD instruction set via vecintrin.h
into llama.cpp on the s390x platform.
Currently the SIMD instruction set is included in the following ggml_vec_dot
functions:
Function | Implementation | Remarks |
---|---|---|
ggml_vec_dot_f32 | IMPLEMENTED | Notice a hotspot for Assembly call vector load. Will fix in another PR. |
ggml_vec_dot_f16 | IMPLEMENTED | Notice a hotspot for Assembly call vector load. Will fix in another PR. |
ggml_vec_dot_q4_0_q8_0 | IMPLEMENTED | |
ggml_vec_dot_q4_1_q8_1 | IMPLEMENTED | |
ggml_vec_dot_q8_0_q8_0 | IMPLEMENTED | |
ggml_vec_dot_q4_K_q8_K | IMPLEMENTED | |
ggml_vec_dot_q5_K_q8_K | IMPLEMENTED | |
ggml_vec_dot_q6_K_q8_K | IMPLEMENTED | |
ggml_vec_dot_iq4_nl_q8_0 | IMPLEMENTED | |
ggml_vec_dot_iq4_xs_q8_K | IMPLEMENTED |
Verification
To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:
- Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
- Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
- Kindly request additional models for testing in this PR
Performance Results
I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.
Before SIMD Instruction Set
model | size | parameters | backend | threads | test | t/s |
---|---|---|---|---|---|---|
Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | pp512 | 16.66 ± 0.01 |
Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | pp512 | 16.30 ± 0.02 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | pp512 | 23.31 ± 0.02 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | pp512 | 26.52 ± 0.03 |
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | pp512 | 29.73 ± 0.03 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | pp512 | 23.91 ± 0.05 |
Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | pp512 | 16.73 ± 0.02 |
Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | pp512 | 12.62 ± 0.01 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | pp512 | 23.88 ± 0.04 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | pp512 | 21.59 ± 0.03 |
Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | tg128 | 8.20 ± 0.07 |
Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | tg128 | 9.70 ± 0.01 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | tg128 | 14.48 ± 0.03 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | tg128 | 15.95 ± 0.06 |
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | tg128 | 19.80 ± 0.04 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | tg128 | 14.89 ± 0.06 |
Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | tg128 | 10.94 ± 0.04 |
Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | tg128 | 8.53 ± 0.02 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | tg128 | 14.38 ± 0.07 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | tg128 | 13.22 ± 0.02 |
After SIMD Instruction Set
model | size | parameters | backend | threads | test | t/s |
---|---|---|---|---|---|---|
Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | pp512 | 85.46 ± 0.09 |
Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | pp512 | 35.39 ± 0.13 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | pp512 | 121.46 ± 0.81 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | pp512 | 123.79 ± 0.40 |
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | pp512 | 137.36 ± 0.52 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | pp512 | 118.88 ± 0.56 |
Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | pp512 | 111.65 ± 0.38 |
Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | pp512 | 101.94 ± 0.59 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | pp512 | 94.28 ± 0.18 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | pp512 | 99.43 ± 0.87 |
Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | tg128 | 14.27 ± 0.29 |
Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | tg128 | 13.97 ± 0.11 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | tg128 | 69.33 ± 1.41 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | tg128 | 65.97 ± 1.71 |
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | tg128 | 57.82 ± 0.60 |
Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | tg128 | 72.14 ± 0.70 |
Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | tg128 | 70.34 ± 0.69 |
Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | tg128 | 63.45 ± 0.68 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | tg128 | 60.09 ± 1.33 |
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | tg128 | 66.48 ± 1.29 |
[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.
Please review this pull request and consider merging into the main repository. Thank you!