llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512

Open MeouSker77 opened this issue 1 year ago • 2 comments

Use only three instructions to implement packNibbles when AVX512 is available. (The _mm256_cvtepi16_epi8 requires AVX512 support)

MeouSker77 avatar Apr 22 '23 09:04 MeouSker77

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Rebasing/merging latest master should fix the failing checks.

sw avatar Apr 22 '23 12:04 sw

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Thank you very much for your suggestion!

MeouSker77 avatar Apr 22 '23 13:04 MeouSker77