SEED
SEED copied to clipboard
Train data
您好,感谢您的开源和杰出的工作!我想问一下在SEED/MultiModalLLM/configs/data/caption_torchdata_preprocess.yaml中 data_dir:
- ${oc.env:PROJECT_ROOT}/data/unsplash_resize/webdataset
- CC3M/webdataset/gcc3m_shards
我想问一下这里的数据集从哪里下载呢?我关注到论文里有说“We filtered the samples in these datasets based on image resolution, aspect ratio, and visual-textual similarity. We randomly place images or text at the forefront, in order to achieve the generation of captions based on images and vice versa.” 如果可以的话,是否可以开源训练数据呢?非常感谢!