llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: How to support model with dynamic inference graph

Open RunningLeon opened this issue 1 year ago • 1 comments

Prerequisites

  • [X] I am running the latest code. Mention the version if possible as well.
  • [X] I carefully followed the README.md.
  • [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

How to support model that has dynamic inference graph between prefiling and decoding stages.

Hi, thanks for your notice. I want to support model in llama.cpp that has a different compute graph between prefilling and decoding stages. I wonder if llama.cpp support dynamic inference graph(skip some layers in prefill stages). If so, how to do this ?

image

Motivation

speed up prefilling by skipping some layers

Possible Implementation

Adding a flag is_prefilling in llama_context just like is_encoding?

RunningLeon avatar Sep 03 '24 11:09 RunningLeon

Same question. Any update?

exhyy avatar Sep 26 '24 05:09 exhyy

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Nov 12 '24 01:11 github-actions[bot]