AutoAWQ
AutoAWQ copied to clipboard
Inference Parallelism Issue
I'm serving a qwen1.5AWQ model using AutoAWQCausalLM.from_quantized(). The performance is great when there is one client. However, when more than one clients are querying, their generated streams seem to be mixed up and quickly collapse into jibberish of the same content. This doesnt happen using transformers' AutoModelForCausalLM with quantized and non-quantized model.
How can I resolve this?
Hi @Zephyr69, I am happy to investigate this issue. Do you have an example of how to trigger the jibberish?
Otherwise, I would advise you to use vLLM to serve your model as it features a production level inference service. It also supports AWQ models.