FasterTransformer
FasterTransformer copied to clipboard
[Doc] Add `projects` section in README which is developed based on FasterTransformer
It is noted that some issues(#506 #729 #727) are requesting FasterTransformer to support Llama and Llama-2. Our project LMDeploy developed based on FasterTransformer, has supported them and their derived models, like vicuna, alpaca, baichuan, and so on.
Meanwhile, LMDeploy has developed a continuous-batch-like feature named persistent-batch, which can handle #696 by the way. It modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process, To put it simply
- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- On cache-hits , history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.
- The batch grows or shrinks automatically to minimize unnecessary computations.
We really appreciate FasterTransformer team for developing such an efficient and high-throughput LLM inference engine
@lvhan028 Cool! I see TurboMind can support llama-2-70b with GQA now. I would like to ask if there will be any support plans for LMDeploy to support Llama-2-7b and Llama-2-13b with GQA ? Thank U!
@AnyangAngus GQA in LMDeploy/TurboMind doesn't distinguish between 7B, 13B, or 70B models.
But as far as I know, llama-2-7b/13b doesn't have GQA block