[Question] about the huggingface data
Required prerequisites
- [x] I have read the documentation https://align-anything.readthedocs.io.
- [x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [x] Consider asking first in a Discussion.
Questions
在align-anything的text-image-to-text subset(https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text) 中,有多个train.parquet, 实际使用的是哪个parquet文件, 是https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 下的文件 还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/train 下的文件 后面两个目录下的内容应该是一样,https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 中的内容是什么,是后续构造的更多的偏好数据吗,构造方法和https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 中的数据的构造方法一样吗?
@Gaiejj @XuyaoWang @yongzhemiaolegemi @cby-pku @htlou 期待您的解答!
感谢对align-anything数据集的关注~ 在align-anything的text-image-to-text subset中,使用的只有https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 是该subset的最新版本,你可以通过以下在Readme中的示例代码直接获取数据集:
train_dataset = load_dataset('PKU-Alignment/align-anything',name='text-image-to-text')['train']