Luo Yu comments

Results 24 comments of


                                            Luo Yu

Add support for phi-3-mini-128k model

@bil-ash Hi, AVX512F here means devices without AVX512_VNNI, and I don't implement u8s8 and s8s8 for AVX512. So it's better to use fp32 for computation. AVX2 devices without AVX_VNNI have...

i saw how beautiful this repo is, in terms of parallelism / numa stuff etc.

> i understand this is intel repo but curious will amd work as well or... what kind of architecture / intel chip set is best used with this repo? For...

heap-buffer-overflow while packing weight

Thanks! We're looking into this issue.

heap-buffer-overflow while packing weight

close as fixed

Neural Speed compilation failing in ORT

it's a warning from GCC13, and it's treated as an error by the compiler flag.

Neural Speed compilation failing in ORT

add this `--compile_no_warning_as_error` to your `build.sh` options should ignore this warning.

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

Thanks for your report! What's the accuracy level of this model's MatMulNBits?

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

I've done some tests on 12900K. The latency result shows that NeuralSpeed(weight_dtype=int4, group_size=32, compute_dtype=int8) beats llama.cpp(phi-2.Q4_0.gguf). > The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps),...

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

we will plan it as a client target enhancement.