ppl.llm.kernel.cuda
ppl.llm.kernel.cuda copied to clipboard
hello开发者, 我看到OpenPPL更新的新闻: https://mp.weixin.qq.com/s/L35pj8fYakvYnL4LYu6nuw 想测试一下这个项目里flashddecoding的速度。请问怎么能够复现文章中的结果,有没有测试脚本? 另外这个项目里的flashdecoding和flashattn项目里的有什么区别么?
Any plan to support BF16 inference? Our model encountered fp16 overflow after deployment.
你好!我想在我的llama13B和百川13B测试decode attention在解码时候的性能效果,请问有没有对应的python接口的示例呢? 期待您的回复!
Hi, the kernels are awesome to support prefill-generate at the same round and it is predictable to have a better performance. However, as most inference/serving frameworks are Python-based, the cpp-only...