llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml-cpu: Support s390x SIMD Instruction Set

Open taronaeo opened this issue 14 hours ago • 1 comments

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform. Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function Implementation Remarks
ggml_vec_dot_f32 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_1_q8_1 IMPLEMENTED
ggml_vec_dot_q8_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_K_q8_K IMPLEMENTED
ggml_vec_dot_q5_K_q8_K IMPLEMENTED
ggml_vec_dot_q6_K_q8_K IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0 IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 13.22 ± 0.02

After SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 66.48 ± 1.29

[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo avatar Feb 22 '25 08:02 taronaeo