vllm icon indicating copy to clipboard operation
vllm copied to clipboard

How integrate with hf with minial modification?

Open lucasjinreal opened this issue 1 year ago • 7 comments

I saw all are wrappers around vllm, how to integrate hf and see out-of-box boost from my existing model?

lucasjinreal avatar Jun 21 '23 03:06 lucasjinreal

Thanks for your interest and great question! You can install vLLM from source and directly modify the model code.

WoosukKwon avatar Jun 21 '23 03:06 WoosukKwon

This is a huge change, is there any easier way to do with llama? I dont want insert these code to my transformers based existing project.

lucasjinreal avatar Jun 21 '23 05:06 lucasjinreal

Thanks for your interest and great question! You can install vLLM from source and directly modify the model code.

Can you guys point out in the documentation which are the necessary modifications? Or give a tutorial on the modification steps of a model.

"Rewrite the forward methods" section in the document is too brief.

liujuncn avatar Jun 21 '23 07:06 liujuncn

@lucasjinreal Is your model different from the original LLaMA? If not, you can simply pass the path to your model weights in llm = LLM(model=<path to your model>) and use the llm object and its generate method in your code.

WoosukKwon avatar Jun 21 '23 08:06 WoosukKwon

@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in using with vLLM? Depending on the model architecture, we might be able to incorporate support for it promptly.

WoosukKwon avatar Jun 21 '23 08:06 WoosukKwon

@WoosukKwon Can u be more specific?

Like I have hf based

model = AutoModelForCausalLM.from_pretrained(
            # base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
            base_model_path,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
            trust_remote_code=True,
            load_in_8bit=load_in_8bit,
            device_map="auto",
        )

How can I specific the from_pretrained and possibely specific the params here? Does the weights same as vllm? How can I specific the optimization method fp16 or bf16 etc?

And my generate loop was with stream, does it supported?

lucasjinreal avatar Jun 21 '23 11:06 lucasjinreal

@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in using with vLLM? Depending on the model architecture, we might be able to incorporate support for it promptly.

For example x-transformers here: https://github.com/lucidrains/x-transformers

We can choose to combine different tricks. So how would a custom model architecture be possible using vLLM?

liujuncn avatar Jun 22 '23 09:06 liujuncn