PowerInfer 请问和llama.cpp 相比有什么优化的地方吗？因为我看大部分代码都是和他重合的

虽然有点冒犯，但是如题

Dec 25 '23 02:12 hariji814

Thank you for your interest in our project. In fact, our work has been developed based on llama.cpp, leading to an overlap in our code. We are immensely grateful for the excellent and easily modifiable code structure provided by llama.cpp. Building upon this, we have made improvements to the model loading method to achieve fine-grained neuron-level split and adjusted some of the corresponding operators. Additionally, we have enhanced the parallel processing capabilities of both CPU and GPU operators. Overall, we do not see starting from scratch as a favorable option; llama.cpp has already laid a solid code foundation for us.

感谢您对我们项目的关注。实际上，我们的工作基于llama.cpp进行了扩展，这也导致了代码的重合。我们非常感谢llama.cpp提供的优秀、易于修改的代码架构。在此基础上，为了实现神经元粒度的划分，我们对模型加载方式进行了改进，并提供了相关的稀疏算子。此外，我们还增强了对CPU和GPU算子的并行处理能力。总的来说，我们并不倾向于从头开始构建整个系统；llama.cpp已经为我们奠定了坚实的代码基础。

Dec 25 '23 02:12 jeremyyx

请问 powerInfo 是基于哪个版本的 llama.cpp 做扩展的？原始的外层接口是否有修改？和 llama-cpp-python是否兼容？

Dec 25 '23 03:12 1562668477

请问 powerInfo 是基于哪个版本的 llama.cpp 做扩展的？原始的外层接口是否有修改？和 llama-cpp-python是否兼容？

完全不兼容需要ReLU化的模型

Dec 25 '23 14:12 sorasoras

接口层面应该是可以兼容的，模型不一致不影响不影响接口层面的。

Dec 26 '23 02:12 1562668477

对于4090 的推理速度我存在质疑。不可能会少于10t/s（注意这是CPU推理的速度）基于llama.cpp拉出的是只适配cpu的，你可以拉出最新的llama.cpp 已经适配了这个问题。再次对比下速度。

Dec 26 '23 03:12 hariji814

对于4090 的推理速度我存在质疑。不可能会少于10t/s（注意这是CPU推理的速度）基于llama.cpp拉出的是只适配cpu的，你可以拉出最新的llama.cpp 已经适配了这个问题。再次对比下速度。

建议您按照我们的论文复现一下相关实验，对比powerinfer和llama.cpp在Falcon的性能。如果发现任何问题，欢迎带着您的数据和我们讨论，谢谢。

Dec 26 '23 12:12 ZeyuMi

请问 powerInfo 是基于哪个版本的 llama.cpp 做扩展的？原始的外层接口是否有修改？和 llama-cpp-python是否兼容？

PowerInfer是基于 llama.cpp 的 6bb4908 commit 分叉而来。由于 llama.cpp自此后一直保持更新并在外层接口有诸多修改，因此在ABI层面PowerInfer和最新的llama.cpp不兼容，也因此无法兼容llama-cpp-python的主线版本。

我尝试在较早版本的llama-cpp-python上兼容了PowerInfer的ABI，创建了这个fork。它可以实现正常的模型加载和推理，我以此为基础搭建了PowerInfer的Gradio server。欢迎试用这个库，但是不鼓励用在任何生产环境中。更多的讨论请见 #64 。

如果只需要应用级别的接口兼容，可以考虑使用 examples/server 来用API server封装内部实现的差异性。

PowerInfer is forked from the llama.cpp's 6bb4908 commit. Since then, llama.cpp has been continuously updated, with numerous changes to its external interfaces. Consequently, at the ABI level, PowerInfer is not compatible with the latest version of llama.cpp, nor with the mainline version of llama-cpp-python.

I have attempted to make PowerInfer's ABI compatible with an earlier version of llama-cpp-python and created this fork. It enables normal model loading and inference, and I have used this as a basis to build PowerInfer's Gradio server. You are welcome to try out this library, but it is not recommended for use in any production environment. For more discussion, please see #64.

If you only need application-level interface compatibility, consider using examples/server to encapsulate the differences in internal implementation through an API server.

Dec 26 '23 18:12 hodlen

请问 powerInfo 是基于哪个版本的 llama.cpp 做扩展的？原始的外层接口是否有修改？和 llama-cpp-python是否兼容？

完全不兼容需要ReLU化的模型

PowerInfer 利用了MLP中两个Linear层参数活跃度的高局部性，挺有创新性的，性能也很赞！

只是目前PowerInfer 需要限定模型MLP中的激活函数使用 ReLU。原始的LLAMA模型使用的是SwiGLU，所以PowerInfer暂时不支持原始的LLAMA模型，需要将模型中的SwiGLU替换成ReLU。请问我的理解对吗？另外，简单替换激活函数之后，如果没有重新训练或者微调，模型推理的准确度怎么样？ PowerInfer 需要限定使用ReLU的原因是什么呢？对于其他激活函数，有观察到MLP中参数活跃度的高局部性吗？

Apr 12 '24 04:04 shifang99