[Feature Request] Support multimodal LLM, e.g., llava

Open StarCycle opened this issue 1 year ago • 2 comments

Hello,

Would you like to support mllm like llava?

### Tasks

Apr 19 '24 13:04 StarCycle

Hi @StarCycle, thanks for the feature request. Multimodal support is something we are still exploring. Would love to learn more about what you would like to use it for. And of course we welcome any initial prototype, if you're interested in contributing this :)

Apr 20 '24 04:04 RdoubleA

Hi @RdoubleA,

Currently I am training llava with Xtuner, which is similar to torchtune. They support finetuning, evaluation and deployment of llava models (we can easily add custom modification to the models). Integration of LLaVA 1.6 and video input is on the way. You can take their implementation as a reference :)

But they rely on HuggingFace transformers...I guess torchtune has less independency, which will be quite good!

Apr 22 '24 01:04 StarCycle