Yufeng Li
Yufeng Li
> I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8. The target machine doesn't...
[like] Yufeng Li reacted to your message: ________________________________ From: luoyu-intel ***@***.***> Sent: Tuesday, April 9, 2024 6:21:54 AM To: intel/neural-speed ***@***.***> Cc: Comment ***@***.***> Subject: Re: [intel/neural-speed] Performance Gap between...
As it won’t be an issue for bits lower than 8 bits, it should be fine. We mainly use blockwise quantization for bits lower than 8.
/azp run orttraining-ortmodule-distributed,
/azp run Linux Android Emulator QNN CI Pipeline
> Can we merge this? Thanks @ChipKerchner !
i think the build failure of QNN CI pipeline is that it uses msvc 14.36, which doesn't support vcvtneeph2ps instruction yet. Other windows CI pipeline uses 14.40. @snnn, any ideas...
A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for...
> > A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append...
> I tried it. Unfortunately it gives me an error if I disable it (is this expected?): > > ``` > "search": { > "diversity_penalty": 0.0, > "do_sample": true, >...