Luo Yu

Results 24 comments of Luo Yu

@bil-ash Hi, AVX512F here means devices without AVX512_VNNI, and I don't implement u8s8 and s8s8 for AVX512. So it's better to use fp32 for computation. AVX2 devices without AVX_VNNI have...

> i understand this is intel repo but curious will amd work as well or... what kind of architecture / intel chip set is best used with this repo? For...

Thanks! We're looking into this issue.

it's a warning from GCC13, and it's treated as an error by the compiler flag.

add this `--compile_no_warning_as_error` to your `build.sh` options should ignore this warning.

Thanks for your report! What's the accuracy level of this model's MatMulNBits?

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

I've done some tests on 12900K. The latency result shows that NeuralSpeed(weight_dtype=int4, group_size=32, compute_dtype=int8) beats llama.cpp(phi-2.Q4_0.gguf). > The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps),...