LLaVA [Question] Finetuning LLaVA for Robots (making a multi turn conversation dataset with images)

[Question] Finetuning LLaVA for Robots (making a multi turn conversation dataset with images)

Open babycommando opened this issue 4 months ago • 5 comments

Question

Hello there LLaVA, beautiful work.

I'm working on integrating LLaVA's vision capabilities with robotics through my project, MachinaScript for Robots, which interprets JSON-syntax commands for Arduino-based robots. While GPT-4-Vision has been beneficial, its cost and speed are prohibitive for broader use. Smaller models like BakLLaVA and Obsidian are faster but struggle with consistent JSON output.

I aim to enhance robot decision-making with LLaVA by achieving two main goals:

Fine-Tuning for JSON Responses: Ensure LLaVA models generate precise JSON-formatted commands from image inputs.
Dataset for Multi-Turn Image Conversations: Develop a dataset for training on sequential image analysis, resembling video frames. Similar to Multi-Turn Chatbot conversation datasets.

Could you provide guidance on the best format and approach for creating such a dataset? Quick, actionable advice on fine-tuning LLaVA models for these specific needs would be greatly appreciated.

Feb 16 '24 18:02 babycommando

LLaVA LLaVA copied to clipboard

[Question] Finetuning LLaVA for Robots (making a multi turn conversation dataset with images)

Question

LLaVA
LLaVA copied to clipboard