SEED icon indicating copy to clipboard operation
SEED copied to clipboard

Train data

Open APiaoG opened this issue 1 year ago • 1 comments

您好,感谢您的开源和杰出的工作!我想问一下在SEED/MultiModalLLM/configs/data/caption_torchdata_preprocess.yaml中 data_dir:

  • ${oc.env:PROJECT_ROOT}/data/unsplash_resize/webdataset
  • CC3M/webdataset/gcc3m_shards

我想问一下这里的数据集从哪里下载呢?我关注到论文里有说“We filtered the samples in these datasets based on image resolution, aspect ratio, and visual-textual similarity. We randomly place images or text at the forefront, in order to achieve the generation of captions based on images and vice versa.” 如果可以的话,是否可以开源训练数据呢?非常感谢!

APiaoG avatar Feb 27 '24 13:02 APiaoG