FasterTransformer [Doc] Add `projects` section in README which is developed based on FasterTransformer

[Doc] Add `projects` section in README which is developed based on FasterTransformer

Open lvhan028 opened this issue 2 years ago • 2 comments

trafficstars

It is noted that some issues(#506 #729 #727) are requesting FasterTransformer to support Llama and Llama-2. Our project LMDeploy developed based on FasterTransformer, has supported them and their derived models, like vicuna, alpaca, baichuan, and so on.

Meanwhile, LMDeploy has developed a continuous-batch-like feature named persistent-batch, which can handle #696 by the way. It modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process, To put it simply

The persistent batch as N pre-configured batch slots.
Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
On cache-hits , history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.
The batch grows or shrinks automatically to minimize unnecessary computations.

We really appreciate FasterTransformer team for developing such an efficient and high-throughput LLM inference engine

Jul 25 '23 04:07 lvhan028

@lvhan028 Cool！ I see TurboMind can support llama-2-70b with GQA now. I would like to ask if there will be any support plans for LMDeploy to support Llama-2-7b and Llama-2-13b with GQA ? Thank U！

Jul 25 '23 06:07 AnyangAngus

@AnyangAngus GQA in LMDeploy/TurboMind doesn't distinguish between 7B, 13B, or 70B models.

But as far as I know, llama-2-7b/13b doesn't have GQA block

Jul 25 '23 07:07 lvhan028

FasterTransformer FasterTransformer copied to clipboard

[Doc] Add `projects` section in README which is developed based on FasterTransformer

FasterTransformer
FasterTransformer copied to clipboard