SEED Train data

Train data

Open APiaoG opened this issue 1 year ago • 1 comments

您好，感谢您的开源和杰出的工作！我想问一下在SEED/MultiModalLLM/configs/data/caption_torchdata_preprocess.yaml中 data_dir:

${oc.env:PROJECT_ROOT}/data/unsplash_resize/webdataset
CC3M/webdataset/gcc3m_shards

我想问一下这里的数据集从哪里下载呢？我关注到论文里有说“We filtered the samples in these datasets based on image resolution, aspect ratio, and visual-textual similarity. We randomly place images or text at the forefront, in order to achieve the generation of captions based on images and vice versa.” 如果可以的话，是否可以开源训练数据呢？非常感谢！

Feb 27 '24 13:02 APiaoG

SEED SEED copied to clipboard

Train data

SEED
SEED copied to clipboard