LLaVA
LLaVA copied to clipboard
[Question] Finetuning LLaVA for Robots (making a multi turn conversation dataset with images)
Question
Hello there LLaVA, beautiful work.
I'm working on integrating LLaVA's vision capabilities with robotics through my project, MachinaScript for Robots, which interprets JSON-syntax commands for Arduino-based robots. While GPT-4-Vision has been beneficial, its cost and speed are prohibitive for broader use. Smaller models like BakLLaVA and Obsidian are faster but struggle with consistent JSON output.
I aim to enhance robot decision-making with LLaVA by achieving two main goals:
-
Fine-Tuning for JSON Responses: Ensure LLaVA models generate precise JSON-formatted commands from image inputs.
-
Dataset for Multi-Turn Image Conversations: Develop a dataset for training on sequential image analysis, resembling video frames. Similar to Multi-Turn Chatbot conversation datasets.
Could you provide guidance on the best format and approach for creating such a dataset? Quick, actionable advice on fine-tuning LLaVA models for these specific needs would be greatly appreciated.