xtuner
xtuner copied to clipboard
How to deploy trained llava model?
Currently the trained llava model can only be used by CLI (without the ability to use new images) or tested using benchmark tools. How can we deploy it using API or WebUI as a more user-friendly interface?
@zodiacg lmdeploy v0.4.0 has supported the deployment of llava-llama-3-8b models. You can try it in https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#chat-by-lmdeploy
At the same time, we will provide a script ASAP to convert the xtuner trained model (such as llava-internlm2 models) to the llava official format model.
It would be very helpful since we have trained some llava models. We hope we can test them in an interactive way.
From your replies do I understand correctly that the merge doesn't add the llava features to the model ?
Here are the steps I followed: (from this and root readme) https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md#model-convert-and-merge
I tried to convert my finetuned result to HF using above guide, then merged it like this to existing xtuner llava:
xtuner convert merge \
"xtuner/llava-llama-3-8b-v1_1" \
"mytrainedmodel/visual_encoder_adapter" \
${SAVE_PATH} \
--max-shard-size 2GB
However writting this I suppose the second parameter is a LLM Qlora and unrelated to the Llava adapter probably ?
@zodiacg @flotos Please follow this new docs https://github.com/LZHgrla/xtuner/tree/lzh/llama3_convert/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336. It introduces the commands for model conversion and chat.
We also release the related LLaVA-Llama-3-8B models, which can be found on above docs.
Hi, thanks for your reply. I have tried to follow the steps, but my folders does not match the ones from the examples, using the Qlora finetune config. Indeed, in my pth to LLava in Xtuner format, I have two folders, llm_adapter, and projector, as well as a xtuner_config.py. No other files, as shown in the README with "visual_encoder_adapter".
Thus, when trying to convert to HF, I did
python ./convert_to_hf.py --text_model_id ./output/merged_mymodel/ --vision_model_id ./output/merged_mymodel/ --projector_weight ./output/merged_mymodel/projector/model.safetensors --save_path ./output/merged_mymodel_hf
Which did not work, with the following error : OSError: ./output/merged_mymodel/ does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./output/merged_mymodel//tree/main' for available files.
I haven't done again the training since my comment two weeks ago, maybe there was an update to the library also which should now include the folder ?
Also, when trying to replace the --vision_model_id by openai/clip-vit-large-patch14-336
I get AttributeError: 'CLIPConfig' object has no attribute 'hidden_size'
@zodiacg @flotos Please follow this new docs https://github.com/LZHgrla/xtuner/tree/lzh/llama3_convert/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336. It introduces the commands for model conversion and chat.
We also release the related LLaVA-Llama-3-8B models, which can be found on above docs.
The scripts introduced are specifically tailored for LLaMA as the LLM. The primary appeal of xtuner, at least from my perspective, is the flexibility it offers to use other LLMs as the base. I hope that the xtuner-llava structure will also be supported.
@zodiacg Yes, we are developing this feature in other PRs;
No longer need cumbersome model conversion, and can directly connect xtuner-llava
to the inference backend.
Hi, thanks for your reply. I have tried to follow the steps, but my folders does not match the ones from the examples, using the Qlora finetune config. Indeed, in my pth to LLava in Xtuner format, I have two folders, llm_adapter, and projector, as well as a xtuner_config.py. No other files, as shown in the README with "visual_encoder_adapter".
Thus, when trying to convert to HF, I did
python ./convert_to_hf.py --text_model_id ./output/merged_mymodel/ --vision_model_id ./output/merged_mymodel/ --projector_weight ./output/merged_mymodel/projector/model.safetensors --save_path ./output/merged_mymodel_hf
Which did not work, with the following error :
OSError: ./output/merged_mymodel/ does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./output/merged_mymodel//tree/main' for available files.
I haven't done again the training since my comment two weeks ago, maybe there was an update to the library also which should now include the folder ?
Also, when trying to replace the --vision_model_id by
openai/clip-vit-large-patch14-336
I getAttributeError: 'CLIPConfig' object has no attribute 'hidden_size'
Hi! @zodiacg You should first merge your llm lora to base llm by
xtuner merge $LLM $LORA_ADAPTER $SAVE_PATH
Then, please use the above saved llm as the value of --text_model_id
.
For the value of --vision_model_id
, since the config you used freezes all parameters of vit, we can directly use openai/clip-vit-large-patch14-336
, and the error can be solved by https://github.com/InternLM/xtuner/pull/661
Thanks, this worked well for me. I have a question however, the script read
freeze_llm=True,
freeze_visual_encoder=True,
Why, if the llm is frozen, do I need to merge a qlora to the base LLM ? Shouldn't it train only the projection layer here ?
Lastly, should the steps above work if I simply change freeze_visual_encoder
to false in the provided gpu1 script (and I do as the readme to merge/convert) ?
Thanks for the help above and your reactivity in previous questions 🙏
@flotos
freeze_llm
setting only freezes the base llm, and doesn't freeze the lora weights. So, in default setting, we should merge the lora into the base llm after training.
As for the freeze_visual_encoder
, if you set it to False, we can get a visual_encoder in exported folder (since it is trained), and we should use this vit to build the llava model.
@flotos
Overall, --text_model_id
should be the llm for llava model and --vision_model_id
should be the clip-vit for llava model.
So, do not forget to merge your lora.
Thanks very much for your time, this is very clear.