llama.cpp
llama.cpp copied to clipboard
Feature Request: How to support model with dynamic inference graph
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
How to support model that has dynamic inference graph between prefiling and decoding stages.
Hi, thanks for your notice. I want to support model in llama.cpp that has a different compute graph between prefilling and decoding stages. I wonder if llama.cpp support dynamic inference graph(skip some layers in prefill stages). If so, how to do this ?
Motivation
speed up prefilling by skipping some layers
Possible Implementation
Adding a flag is_prefilling in llama_context just like is_encoding?
Same question. Any update?
This issue was closed because it has been inactive for 14 days since being marked as stale.