align-anything icon indicating copy to clipboard operation
align-anything copied to clipboard

[Question] about the huggingface data

Open zhang123434 opened this issue 2 months ago • 1 comments

Required prerequisites

Questions

在align-anything的text-image-to-text subset(https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text) 中,有多个train.parquet, 实际使用的是哪个parquet文件, 是https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 下的文件 还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/train 下的文件 后面两个目录下的内容应该是一样,https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 中的内容是什么,是后续构造的更多的偏好数据吗,构造方法和https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 中的数据的构造方法一样吗?

@Gaiejj @XuyaoWang @yongzhemiaolegemi @cby-pku @htlou 期待您的解答!

zhang123434 avatar Oct 26 '25 03:10 zhang123434

感谢对align-anything数据集的关注~ 在align-anything的text-image-to-text subset中,使用的只有https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 是该subset的最新版本,你可以通过以下在Readme中的示例代码直接获取数据集:

train_dataset = load_dataset('PKU-Alignment/align-anything',name='text-image-to-text')['train']

d4yz3ro avatar Nov 12 '25 07:11 d4yz3ro