OpenLLM Inference Speed comparison

Inference Speed comparison

Open that-rahul-guy opened this issue 1 year ago • 1 comments

Hello everyone,

I am trying to serve TheBloke/Mistral-7B-Instruct-v0.1-GPTQ model.

I'm currently serving the model via jina and internally I am doing the predictions as:

input_ids = text_tokenizer.encode(prompt, return_tensors="pt").cuda()

# Generate output with custom configuration
output_ids = text_model.generate(input_ids, **gen_config)

generated_ids = output_ids[:, num_input_tokens:]
# Decode only the generated part
output = text_tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Here manually we are generating the output tokens and then decoding it. For my test prompt, the performance is as:

time taken: ~2.5sec token/sec: ~6

Now, while serving it via OpenLLM using following command: openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt For the same test prompt, the performance is as:

time taken: ~30sec token/sec: ~0.5
Env:
- OS: Ubuntu 20.04
- GPU: Nvidia T4 (16GiB vRAM)
- Cuda: 11.8

I want to discuss why is the difference so big. Am I doing something wrong while serving with OpenLLM? Let me know your thoughts.

Thanks

Dec 18 '23 10:12 that-rahul-guy

gptq is now supported with vLLM and latest openllm version. You can test it with vLLM as I haven't update the pytorch code path for a while now.

You should see signi improve with vLLM.

Dec 18 '23 18:12 aarnphm

OpenLLM OpenLLM copied to clipboard

Inference Speed comparison

OpenLLM
OpenLLM copied to clipboard