text-generation-inference Adding support for CTranslate2 acceleration

Feature request

The feature would be to support accelerated inference with the CTranslate2 framework https://github.com/OpenNMT/CTranslate2

Motivation

Reasons to CTranslate2

faster float16 generation

In my case, outperforms VLLM #478 by a factor of 180% and Transformers by 190%. This is measured for gpt_bigcode_starcoder, with input length of 10, batch_size of 2, max_new_tokens=32

even faster int8 quantization

int8 quantization that in most cases outperforms float16 inference, quality loss is minimal while being identical to

Wide range of model support

This could be a new "best effort" option - https://opennmt.net/CTranslate2/guides/transformers.html

Stability

Compared to other frameworks, cross-platform unit tests. It is the backend of LibreTranslate https://github.com/LibreTranslate/LibreTranslate

Streaming Support

https://opennmt.net/CTranslate2/generation.html?highlight=stream

Your contribution

Here is an example script, comparing the three frameworks with ray[serve] (replacement of the dynamic batching in TGI).

With the code below, my timings are: In: input: "def hello_world" -> 32 new tokens Median time of 10:

Transformers : 287.6ms
VLLM : 278.8ms
CTranslate2 : 193.5ms


from fastapi import FastAPI
from ray import serve, get_gpu_ids
import time
from typing import List, Dict
from pydantic import BaseModel
from enum import Enum

app = FastAPI(docs_url="/docs") # lets not do any authentication for now 

class HFAPIRequest(BaseModel):
    inputs: str
    parameters: dict = {}

class Framework(Enum):
    ctranslate2 = 1
    transformers = 2
    vllm = 3


@serve.deployment
@serve.ingress(app)
class FastAPIWrapper:
    def __init__(self, model_handle) -> None:
        self.model_handle = model_handle
    
    @app.get("/v1/heartbeat")
    async def complete(self) -> float:
        """check that the application is running"""
        return time.time()
        
    @app.post("/v1/complete")
    async def complete(self, data: HFAPIRequest) -> Dict[str,str]:
        """send json-data per http request. returns a completion"""
        try:
            d_dict=data.dict()
            # send the call, get ray_ref as receipt
            ray_ref = await self.model_handle.predict.remote(inputs=d_dict["inputs"])
            # wait until ray_ref finished
            completion = await ray_ref
            return {"generated_text": completion}
        except Exception as ex:
            return  {"error": str(ex)}

@serve.deployment(ray_actor_options={"num_gpus": 1})
class CodeCompletionModel:
    def __init__(self, model_name = "michaelfeil/ct2fast-gpt_bigcode-santacoder", framework: Framework = Framework.ctranslate2 ) -> None:
        self.framework = framework
        gpus = bool(get_gpu_ids())
        
        if self.framework == Framework.ctranslate2:
            # 1784MiB VRAM, 314.3ms for generation
            from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
            
            self.model = GeneratorCT2fromHfHub(
                    # use a pre-quantized model here
                    model_name_or_path=model_name, 
                    device="cuda" if gpus else "cpu",
                    compute_type="int8_float16" if gpus else "int8",
            )
        elif self.framework == Framework.transformers:
            # 2918MiB VRAM, ~1146.8ms for generation
            from transformers import pipeline
            import torch
            self.model = pipeline(
                task="text-generation",
                model="bigcode/gpt_bigcode-santacoder", 
                device="cuda:0" if gpus else "cpu",
                torch_dtype=torch.float16
            )
        elif self.framework == Framework.vllm:
            from vllm import LLM, SamplingParams
            self.sampling_params = SamplingParams(max_tokens=32)
            self.model  = LLM(model="bigcode/gpt_bigcode-santacoder")
        else:
            raise ValueError(
                f"{self.framework} is not a supported framework."
                f"Please use one of the following: {list(Framework)}"
            )
    
    
    @serve.batch(max_batch_size=4)
    async def predict(self, inputs: List[str]) -> List[str]:
        if self.framework == Framework.ctranslate2:
            return self.model.generate(
                text=inputs,
                max_length=32, 
                include_prompt_in_result=False
            )
        elif self.framework == Framework.transformers:
            out = self.model(inputs, max_new_tokens=32, return_full_text=True)
            return [o[0]["generated_text"] for o in out]
        elif self.framework == Framework.vllm:
            outputs = self.model.generate(inputs, self.sampling_params)
            return [o.outputs[0].text for o in outputs]
    
deploy_handle = FastAPIWrapper.bind(model_handle=CodeCompletionModel.bind(framework=Framework.ctranslate2))

if __name__ == "__main__":
    serve.run(deploy_handle, port=5000)

Jun 26 '23 19:06 michaelfeil

The batch size 2 seems very small.

Jun 28 '23 18:06 Atry

Feel free to try your own example.

if __name__ == "__main__":
    import timeit
    for framework in Framework:
        start = timeit.default_timer()        
        model = CodeCompletionModel(framework=framework)
        end = timeit.default_timer()     
        model.predict_batch(["warmup"])
        inp = [("def " * i) for i in  range(1,16)] # batch size 16 with increasing number of tokens
        t = timeit.timeit("model.predict_batch(inp)", globals=locals(), number=10) / 10
        print("framework: ", framework, " mean time per batch: ", t, "s, model load time was ", end -start, "s")

e.g. batch size 64, with increasing size

framework:  Framework.vllm  mean time per batch:  0.9124439172912389 s, model load time was  6.033201194833964 s
framework:  Framework.ctranslate2  mean time per batch:  0.5082830318249763 s, model load time was  7.9577884785830975 s

Jun 29 '23 09:06 michaelfeil

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

Jul 03 '23 12:07 aliswel-mt

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

Yung-Kai The above uses santacoder (1.1B params), with 32 tokens. Whats the relative speed difference across all frameworks?

Jul 03 '23 14:07 michaelfeil

@michaelfeil Thanks for your reply, I got around 16s on HF, and 5s for 4 bit quantization with autoGPTQ. Haven't try vLLM due to OOM in my case, I enlarge the max_num_seq. I guess starcoder should take longer than santacoder, but not sure how long it would be

Jul 04 '23 02:07 aliswel-mt

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 15 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Adding support for CTranslate2 acceleration

Feature request

Motivation

faster float16 generation

even faster int8 quantization

Wide range of model support

Stability

Streaming Support

Your contribution

text-generation-inference
text-generation-inference copied to clipboard