AutoAWQ Inference Parallelism Issue

Inference Parallelism Issue

Open Zephyr69 opened this issue 1 year ago • 1 comments

I'm serving a qwen1.5AWQ model using AutoAWQCausalLM.from_quantized(). The performance is great when there is one client. However, when more than one clients are querying, their generated streams seem to be mixed up and quickly collapse into jibberish of the same content. This doesnt happen using transformers' AutoModelForCausalLM with quantized and non-quantized model.

How can I resolve this?

Apr 14 '24 10:04 Zephyr69

Hi @Zephyr69, I am happy to investigate this issue. Do you have an example of how to trigger the jibberish?

Otherwise, I would advise you to use vLLM to serve your model as it features a production level inference service. It also supports AWQ models.

Apr 14 '24 11:04 casper-hansen

AutoAWQ AutoAWQ copied to clipboard

Inference Parallelism Issue

AutoAWQ
AutoAWQ copied to clipboard