llama.cpp ggml-cpu: Support s390x SIMD Instruction Set

ggml-cpu: Support s390x SIMD Instruction Set

Open taronaeo opened this issue 14 hours ago • 1 comments

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform. Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function	Implementation	Remarks
ggml_vec_dot_f32	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_1_q8_1	IMPLEMENTED
ggml_vec_dot_q8_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_K_q8_K	IMPLEMENTED
ggml_vec_dot_q5_K_q8_K	IMPLEMENTED
ggml_vec_dot_q6_K_q8_K	IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0	IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K	IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	13.22 ± 0.02

After SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	66.48 ± 1.29

[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Feb 22 '25 08:02 taronaeo

llama.cpp llama.cpp copied to clipboard

ggml-cpu: Support s390x SIMD Instruction Set

Verification

Performance Results

llama.cpp
llama.cpp copied to clipboard