llama.cpp A better `packNibbles` and `mul_sum_i8_pairs

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512

Open MeouSker77 opened this issue 1 year ago • 2 comments

Use only three instructions to implement packNibbles when AVX512 is available. (The _mm256_cvtepi16_epi8 requires AVX512 support)

Apr 22 '23 09:04 MeouSker77

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Rebasing/merging latest master should fix the failing checks.

Apr 22 '23 12:04 sw

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Thank you very much for your suggestion!

Apr 22 '23 13:04 MeouSker77

llama.cpp llama.cpp copied to clipboard

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512

llama.cpp
llama.cpp copied to clipboard