vllm
vllm copied to clipboard
How integrate with hf with minial modification?
I saw all are wrappers around vllm, how to integrate hf and see out-of-box boost from my existing model?
Thanks for your interest and great question! You can install vLLM from source and directly modify the model code.
This is a huge change, is there any easier way to do with llama? I dont want insert these code to my transformers based existing project.
Thanks for your interest and great question! You can install vLLM from source and directly modify the model code.
Can you guys point out in the documentation which are the necessary modifications? Or give a tutorial on the modification steps of a model.
"Rewrite the forward methods" section in the document is too brief.
@lucasjinreal Is your model different from the original LLaMA? If not, you can simply pass the path to your model weights in llm = LLM(model=<path to your model>)
and use the llm
object and its generate
method in your code.
@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in using with vLLM? Depending on the model architecture, we might be able to incorporate support for it promptly.
@WoosukKwon Can u be more specific?
Like I have hf based
model = AutoModelForCausalLM.from_pretrained(
# base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
base_model_path,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
trust_remote_code=True,
load_in_8bit=load_in_8bit,
device_map="auto",
)
How can I specific the from_pretrained
and possibely specific the params here? Does the weights same as vllm? How can I specific the optimization method fp16 or bf16 etc?
And my generate loop was with stream, does it supported?
@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in using with vLLM? Depending on the model architecture, we might be able to incorporate support for it promptly.
For example x-transformers here: https://github.com/lucidrains/x-transformers
We can choose to combine different tricks. So how would a custom model architecture be possible using vLLM?