Aaron Pham
Aaron Pham
This probably has to do with the base image also includes vllm and all different dependencies for vllm
Hi there, the vllm backend is not yet supported with adapters.
How many GPU do you have?
Can you also send the whole stack trace from the server?
So did the model startup correctly? i was able to run llama-2-7b with 1 T4
This is consequent requests right?
Yes so this is currently a bug that has also been reported else where, I'm taking a look atm.
btw you can change the `max_new_tokens` per request. I'm going to change the environment variable changes soon
You can try `--quantize gptq` for now Just have a lot of priority atm
@gbmarc1 can you try again with 0.3.5?