Simon Mo
Simon Mo
Can we go further on reducing the templating to purely JSON schema? I believe it is possible by framing it as { "tool_choice": one of the tool name "tool_params": constrained...
cc @njhill
I think the original idea is openai style API also have `stream` flag that change the behavior of the output
Which hardware are you using? It looks like after processing the prompt, there's very little free space left for computing the generation tokens. See (`# GPU blocks: 37`). Maybe consider...
Yeah it does look like two T4 gives you 32G GPU memory. The 13B model takes about 26G in parameters, which leaves every little for KV Cache. Maybe use just...
sorry i just merged the other PR, can you resolve the conflict?
🤦♂️ sorry another conflict
Sounds good. I agree with @casper-hansen that this is very valuable and a good start for #3780
At a high level I would imagine running more end to end test like https://github.com/EleutherAI/lm-evaluation-harness which can directly support vLLM with simpler command should be better? For actual testing I...
This is a different task. @youkaichao can you create a new issue tracking it?