Aaron Pham

Results 429 comments of Aaron Pham
trafficstars

This probably has to do with the base image also includes vllm and all different dependencies for vllm

Can you also send the whole stack trace from the server?

So did the model startup correctly? i was able to run llama-2-7b with 1 T4

This is consequent requests right?

Yes so this is currently a bug that has also been reported else where, I'm taking a look atm.

btw you can change the `max_new_tokens` per request. I'm going to change the environment variable changes soon

You can try `--quantize gptq` for now Just have a lot of priority atm

@gbmarc1 can you try again with 0.3.5?