XNNPACK
XNNPACK copied to clipboard
QB4W SSE2/SSE41 GEMM Kernels
This pull requests adds blockwise 4-bit (qb4w) GEMM microkernels targetinsg x86 SSE2 and SSE4.1 Instruction Family.
Note: This PR includes one commit from https://github.com/google/XNNPACK/pull/6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.
Tests and Benchmarks were run on Icelake Xeon Processor. block_size equal to KC are functionally equivalent to QC4W, so QC4W provides a reasonable performance comparison.
SSE2 Benchmarks
AVERAGE of OPS | |||||||||
---|---|---|---|---|---|---|---|---|---|
M | 128 | ||||||||
N | 16 | 128 | 4096 | 11008 | 32000 | ||||
Tile Size | Kernel | BL | Kernel Type | K | 1024 | 1024 | 1024 | 4096 | 4096 |
1x4c8 | qd8_f32_qb4w | 32 | sse2_ld128 | 16.3626 | 16.5004 | 16.4693 | 16.2591 | 15.1152 | |
sse2_ld64 | 13.8485 | 13.8654 | 13.7583 | 13.7609 | 12.9008 | ||||
256 | sse2_ld128 | 19.1497 | 19.1257 | 19.2257 | 19.0563 | 17.9628 | |||
sse2_ld64 | 15.5274 | 15.5863 | 15.541 | 15.4571 | 14.8878 | ||||
qd8_f32_qc4w | sse2_ld128 | 19.0017 | 19.3572 | 19.4491 | 19.1485 | 18.5355 | |||
sse2_ld64 | 15.364 | 15.311 | 15.4119 | 15.3349 | 15.4244 | ||||
2x4c8 | qd8_f32_qb4w | 32 | sse2_ld128 | 19.7986 | 19.7132 | 19.3471 | 19.874 | 19.457 | |
sse2_ld64 | 15.0903 | 16.8612 | 13.1044 | 16.8245 | 17.2448 | ||||
256 | sse2_ld128 | 24.3207 | 24.276 | 24.263 | 24.386 | 23.4275 | |||
sse2_ld64 | 16.5991 | 19.1659 | 17.5122 | 20.0571 | 21.1375 | ||||
qd8_f32_qc4w | sse2_ld128 | 24.7754 | 24.3957 | 24.3669 | 25.0426 | 24.9166 | |||
sse2_ld64 | 21.6573 | 21.4722 | 21.3531 | 21.6695 | 21.7461 | ||||
3x4c8 | qd8_f32_qb4w | 32 | sse2_ld128 | 21.0667 | 20.9242 | 20.4278 | 21.1425 | 20.9614 | |
sse2_ld64 | 19.7949 | 19.862 | 19.8228 | 19.78 | 19.6134 | ||||
256 | sse2_ld128 | 26.2247 | 25.7603 | 26.5098 | 26.3606 | 26.2232 | |||
sse2_ld64 | 24.0754 | 24.1035 | 24.1776 | 24.1109 | 23.6717 | ||||
qd8_f32_qc4w | sse2_ld128 | 26.8972 | 27.2868 | 27.1239 | 27.4936 | 27.3707 | |||
sse2_ld64 | 24.008 | 23.7783 | 24.4216 | 24.9255 | 24.7481 | ||||
4x4c8 | qd8_f32_qb4w | 32 | sse2_ld128 | 21.546 | 21.2863 | 21.9667 | 22.0366 | 21.8368 | |
sse2_ld64 | 20.2396 | 20.4829 | 20.3171 | 20.1945 | 20.327 | ||||
256 | sse2_ld128 | 27.4331 | 27.4372 | 27.6257 | 27.7552 | 27.7811 | |||
sse2_ld64 | 25.1659 | 25.7404 | 25.4879 | 25.5119 | 25.4176 | ||||
qd8_f32_qc4w | sse2_ld128 | 28.0353 | 28.4844 | 28.1422 | 28.431 | 28.4336 | |||
sse2_ld64 | 26.1268 | 26.5393 | 26.4284 | 26.8376 | 26.746 |
SSE4.1 Benchmarks
AVERAGE of OPS | |||||||||
---|---|---|---|---|---|---|---|---|---|
M | 128 | ||||||||
N | 16 | 128 | 4096 | 11008 | 32000 | ||||
Tile Size | Kernel | BL | Kernel Type | K | 1024 | 1024 | 1024 | 4096 | 4096 |
1x4c8 | qd8_f32_qb4w | 32 | sse41_ld128 | 17.3014 | 17.3535 | 17.2064 | 17.1379 | 16.0412 | |
sse41_ld64 | 16.9989 | 16.8812 | 16.9124 | 16.8923 | 16.2466 | ||||
256 | sse41_ld128 | 20.1779 | 20.2345 | 20.2459 | 20.2033 | 19.4152 | |||
sse41_ld64 | 19.4751 | 19.1835 | 19.4524 | 19.2826 | 18.5472 | ||||
qd8_f32_qc4w | N/A | sse41_ld128 | 20.4684 | 20.5874 | 20.1799 | 20.5715 | 18.9625 | ||
sse41_ld64 | 19.6819 | 19.3431 | 19.7433 | 19.5436 | 19.2356 | ||||
2x4c8 | qd8_f32_qb4w | 32 | sse41_ld128 | 21.2354 | 21.1893 | 21.1614 | 20.9931 | 20.8421 | |
sse41_ld64 | 21.3569 | 21.4488 | 21.235 | 20.9837 | 21.0178 | ||||
256 | sse41_ld128 | 26.0017 | 25.9916 | 25.4473 | 25.5135 | 25.57 | |||
sse41_ld64 | 25.2493 | 25.7025 | 26.085 | 25.9694 | 25.2185 | ||||
qd8_f32_qc4w | N/A | sse41_ld128 | 26.8566 | 27.1628 | 27.0692 | 27.3674 | 26.9184 | ||
sse41_ld64 | 26.183 | 26.2211 | 26.139 | 26.5423 | 26.6486 | ||||
3x4c8 | qd8_f32_qb4w | 32 | sse41_ld128 | 22.6084 | 22.5638 | 22.2168 | 22.7137 | 22.4851 | |
sse41_ld64 | 22.2074 | 22.2576 | 22.1917 | 22.2111 | 21.9517 | ||||
256 | sse41_ld128 | 28.6587 | 28.6288 | 29.0569 | 28.9576 | 28.5065 | |||
sse41_ld64 | 28.233 | 28.3912 | 28.5254 | 28.3961 | 28.1014 | ||||
qd8_f32_qc4w | N/A | sse41_ld128 | 29.2685 | 29.6141 | 29.612 | 29.7231 | 29.4312 | ||
sse41_ld64 | 29.0769 | 29.047 | 29.7566 | 30.0988 | 29.7775 | ||||
4x4c8 | qd8_f32_qb4w | 32 | sse41_ld128 | 23.4833 | 22.5307 | 23.0766 | 23.3969 | 23.209 | |
sse41_ld64 | 23.0014 | 23.5207 | 23.657 | 23.5848 | 23.4027 | ||||
256 | sse41_ld128 | 30.0163 | 29.9047 | 30.2299 | 30.367 | 29.8178 | |||
sse41_ld64 | 30.0503 | 29.1273 | 29.8003 | 30.3124 | 30.0438 | ||||
qd8_f32_qc4w | N/A | sse41_ld128 | 30.7884 | 31.3583 | 31.1508 | 31.4678 | 30.5934 | ||
sse41_ld64 | 31.2101 | 30.8091 | 31.3049 | 31.2445 | 31.8041 |