OpenLLM icon indicating copy to clipboard operation
OpenLLM copied to clipboard

Inference Speed comparison

Open that-rahul-guy opened this issue 1 year ago • 1 comments

Hello everyone,

  • I am trying to serve TheBloke/Mistral-7B-Instruct-v0.1-GPTQ model.

  • I'm currently serving the model via jina and internally I am doing the predictions as:

    input_ids = text_tokenizer.encode(prompt, return_tensors="pt").cuda()
    
    # Generate output with custom configuration
    output_ids = text_model.generate(input_ids, **gen_config)
    
    generated_ids = output_ids[:, num_input_tokens:]
    # Decode only the generated part
    output = text_tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    

    Here manually we are generating the output tokens and then decoding it. For my test prompt, the performance is as:

    time taken: ~2.5sec token/sec: ~6

  • Now, while serving it via OpenLLM using following command: openllm start TheBloke/Mistral-7B-Instruct-v0.1-GPTQ --quantize gptq --backend pt For the same test prompt, the performance is as:

    time taken: ~30sec token/sec: ~0.5

  • Env:

    • OS: Ubuntu 20.04
    • GPU: Nvidia T4 (16GiB vRAM)
    • Cuda: 11.8

I want to discuss why is the difference so big. Am I doing something wrong while serving with OpenLLM? Let me know your thoughts.

Thanks

that-rahul-guy avatar Dec 18 '23 10:12 that-rahul-guy

gptq is now supported with vLLM and latest openllm version. You can test it with vLLM as I haven't update the pytorch code path for a while now.

You should see signi improve with vLLM.

aarnphm avatar Dec 18 '23 18:12 aarnphm