DeepSeek-VL
DeepSeek-VL copied to clipboard
dataset format of pretraining stage
trafficstars
How did you unify the format of pretraining dataset? During supervised fine tuning stage, the training data are curated as question and answer pairs. For caption or detection dataset, I want to know if they follow the same format as sft data, and how to collect questions for these data as they originally only contains ground truth like caption or boxes?