Yufeng Li comments

Results 86 comments of


                                            Yufeng Li

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

> I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8. The target machine doesn't...

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

[like] Yufeng Li reacted to your message: ________________________________ From: luoyu-intel ***@***.***> Sent: Tuesday, April 9, 2024 6:21:54 AM To: intel/neural-speed ***@***.***> Cc: Comment ***@***.***> Subject: Re: [intel/neural-speed] Performance Gap between...

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

As it won’t be an issue for bits lower than 8 bits, it should be fine. We mainly use blockwise quantization for bits lower than 8.

Improvements to the INT8 GEMM portion of the code for Power

/azp run orttraining-ortmodule-distributed,

Improvements to the INT8 GEMM portion of the code for Power

/azp run Linux Android Emulator QNN CI Pipeline

Improvements to the INT8 GEMM portion of the code for Power

> Can we merge this? Thanks @ChipKerchner !

Enable AVX NE CONVERT for FP16 to FP32 cast

i think the build failure of QNN CI pipeline is that it uses msvc 14.36, which doesn't support vcvtneeph2ps instruction yet. Other windows CI pipeline uses 14.40. @snnn, any ideas...

[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime?

A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for...

[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime?

> > A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append...

[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime?

> I tried it. Unfortunately it gives me an error if I disable it (is this expected?): > > ``` > "search": { > "diversity_penalty": 0.0, > "do_sample": true, >...