llama.cpp
llama.cpp copied to clipboard
A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512
Use only three instructions to implement packNibbles
when AVX512 is available. (The _mm256_cvtepi16_epi8
requires AVX512 support)
With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32
in mul_sum_i8_pairs_float
, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__
)
Rebasing/merging latest master should fix the failing checks.
With an AVX512 machine, you may want to look into using
_mm256_dpbssd_epi32
inmul_sum_i8_pairs_float
, that could give another speed boost. (Preprocessor condition:#if __AVXVNNIINT8__
)
Thank you very much for your suggestion!