OpenThinkIMG
OpenThinkIMG copied to clipboard
Can segmentation and multimodal understanding (e.g., VQA) tasks be trained simultaneously?