xtuner How to deploy trained llava model?

Currently the trained llava model can only be used by CLI (without the ability to use new images) or tested using benchmark tools. How can we deploy it using API or WebUI as a more user-friendly interface?

Apr 19 '24 05:04 zodiacg

@zodiacg lmdeploy v0.4.0 has supported the deployment of llava-llama-3-8b models. You can try it in https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#chat-by-lmdeploy

At the same time, we will provide a script ASAP to convert the xtuner trained model (such as llava-internlm2 models) to the llava official format model.

Apr 24 '24 05:04 LZHgrla

It would be very helpful since we have trained some llava models. We hope we can test them in an interactive way.

Apr 24 '24 08:04 zodiacg

From your replies do I understand correctly that the merge doesn't add the llava features to the model ?

Here are the steps I followed: (from this and root readme) https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md#model-convert-and-merge

I tried to convert my finetuned result to HF using above guide, then merged it like this to existing xtuner llava:

xtuner convert merge \
    "xtuner/llava-llama-3-8b-v1_1" \
    "mytrainedmodel/visual_encoder_adapter" \
    ${SAVE_PATH} \
    --max-shard-size 2GB

However writting this I suppose the second parameter is a LLM Qlora and unrelated to the Llava adapter probably ?

Apr 26 '24 10:04 flotos

@zodiacg @flotos Please follow this new docs https://github.com/LZHgrla/xtuner/tree/lzh/llama3_convert/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336. It introduces the commands for model conversion and chat.

We also release the related LLaVA-Llama-3-8B models, which can be found on above docs.

Apr 26 '24 10:04 LZHgrla

Hi, thanks for your reply. I have tried to follow the steps, but my folders does not match the ones from the examples, using the Qlora finetune config. Indeed, in my pth to LLava in Xtuner format, I have two folders, llm_adapter, and projector, as well as a xtuner_config.py. No other files, as shown in the README with "visual_encoder_adapter".

Thus, when trying to convert to HF, I did python ./convert_to_hf.py --text_model_id ./output/merged_mymodel/ --vision_model_id ./output/merged_mymodel/ --projector_weight ./output/merged_mymodel/projector/model.safetensors --save_path ./output/merged_mymodel_hf

Which did not work, with the following error : OSError: ./output/merged_mymodel/ does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./output/merged_mymodel//tree/main' for available files.

I haven't done again the training since my comment two weeks ago, maybe there was an update to the library also which should now include the folder ?

Also, when trying to replace the --vision_model_id by openai/clip-vit-large-patch14-336 I get AttributeError: 'CLIPConfig' object has no attribute 'hidden_size'

May 07 '24 18:05 flotos

@zodiacg @flotos Please follow this new docs https://github.com/LZHgrla/xtuner/tree/lzh/llama3_convert/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336. It introduces the commands for model conversion and chat.

We also release the related LLaVA-Llama-3-8B models, which can be found on above docs.

The scripts introduced are specifically tailored for LLaMA as the LLM. The primary appeal of xtuner, at least from my perspective, is the flexibility it offers to use other LLMs as the base. I hope that the xtuner-llava structure will also be supported.

May 08 '24 02:05 zodiacg

@zodiacg Yes, we are developing this feature in other PRs; No longer need cumbersome model conversion, and can directly connect xtuner-llava to the inference backend.

May 08 '24 04:05 pppppM

Hi, thanks for your reply. I have tried to follow the steps, but my folders does not match the ones from the examples, using the Qlora finetune config. Indeed, in my pth to LLava in Xtuner format, I have two folders, llm_adapter, and projector, as well as a xtuner_config.py. No other files, as shown in the README with "visual_encoder_adapter".

Thus, when trying to convert to HF, I did

python ./convert_to_hf.py --text_model_id ./output/merged_mymodel/ --vision_model_id ./output/merged_mymodel/ --projector_weight ./output/merged_mymodel/projector/model.safetensors --save_path ./output/merged_mymodel_hf

Which did not work, with the following error : OSError: ./output/merged_mymodel/ does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./output/merged_mymodel//tree/main' for available files.

I haven't done again the training since my comment two weeks ago, maybe there was an update to the library also which should now include the folder ?

Also, when trying to replace the --vision_model_id by openai/clip-vit-large-patch14-336 I get AttributeError: 'CLIPConfig' object has no attribute 'hidden_size'

Hi！ @zodiacg You should first merge your llm lora to base llm by

xtuner merge $LLM $LORA_ADAPTER $SAVE_PATH

Then, please use the above saved llm as the value of --text_model_id.

For the value of --vision_model_id, since the config you used freezes all parameters of vit, we can directly use openai/clip-vit-large-patch14-336, and the error can be solved by https://github.com/InternLM/xtuner/pull/661

May 08 '24 06:05 LZHgrla

Thanks, this worked well for me. I have a question however, the script read

    freeze_llm=True,
    freeze_visual_encoder=True,

Why, if the llm is frozen, do I need to merge a qlora to the base LLM ? Shouldn't it train only the projection layer here ? Lastly, should the steps above work if I simply change freeze_visual_encoder to false in the provided gpu1 script (and I do as the readme to merge/convert) ?

Thanks for the help above and your reactivity in previous questions 🙏

May 08 '24 09:05 flotos

@flotos freeze_llm setting only freezes the base llm, and doesn't freeze the lora weights. So, in default setting, we should merge the lora into the base llm after training.

As for the freeze_visual_encoder, if you set it to False, we can get a visual_encoder in exported folder (since it is trained), and we should use this vit to build the llava model.

May 08 '24 09:05 LZHgrla

@flotos Overall, --text_model_id should be the llm for llava model and --vision_model_id should be the clip-vit for llava model.

So, do not forget to merge your lora.

May 08 '24 09:05 LZHgrla

Thanks very much for your time, this is very clear.

May 08 '24 09:05 flotos

xtuner xtuner copied to clipboard

How to deploy trained llava model?

xtuner
xtuner copied to clipboard