torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[Feature Request] Support multimodal LLM, e.g., llava

Open StarCycle opened this issue 1 year ago • 2 comments

Hello,

Would you like to support mllm like llava?

### Tasks

StarCycle avatar Apr 19 '24 13:04 StarCycle

Hi @StarCycle, thanks for the feature request. Multimodal support is something we are still exploring. Would love to learn more about what you would like to use it for. And of course we welcome any initial prototype, if you're interested in contributing this :)

RdoubleA avatar Apr 20 '24 04:04 RdoubleA

Hi @RdoubleA,

Currently I am training llava with Xtuner, which is similar to torchtune. They support finetuning, evaluation and deployment of llava models (we can easily add custom modification to the models). Integration of LLaVA 1.6 and video input is on the way. You can take their implementation as a reference :)

But they rely on HuggingFace transformers...I guess torchtune has less independency, which will be quite good!

StarCycle avatar Apr 22 '24 01:04 StarCycle