OpenLLM
OpenLLM copied to clipboard
Inference Speed comparison
Hello everyone,
-
I am trying to serve
TheBloke/Mistral-7B-Instruct-v0.1-GPTQ
model. -
I'm currently serving the model via
jina
and internally I am doing the predictions as:input_ids = text_tokenizer.encode(prompt, return_tensors="pt").cuda() # Generate output with custom configuration output_ids = text_model.generate(input_ids, **gen_config) generated_ids = output_ids[:, num_input_tokens:] # Decode only the generated part output = text_tokenizer.decode(generated_ids[0], skip_special_tokens=True)
Here manually we are generating the output tokens and then decoding it. For my test prompt, the performance is as:
time taken: ~2.5sec token/sec: ~6
-
Now, while serving it via OpenLLM using following command:
openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt
For the same test prompt, the performance is as:time taken: ~30sec token/sec: ~0.5
-
Env:
- OS: Ubuntu 20.04
- GPU: Nvidia T4 (16GiB vRAM)
- Cuda: 11.8
I want to discuss why is the difference so big. Am I doing something wrong while serving with OpenLLM? Let me know your thoughts.
Thanks
gptq is now supported with vLLM and latest openllm version. You can test it with vLLM as I haven't update the pytorch code path for a while now.
You should see signi improve with vLLM.