[Question] about the huggingface data

Open zhang123434 opened this issue 2 months ago • 1 comments

Required prerequisites

[x] I have read the documentation https://align-anything.readthedocs.io.
[x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[x] Consider asking first in a Discussion.

Questions

在align-anything的text-image-to-text subset（https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text）中，有多个train.parquet, 实际使用的是哪个parquet文件，是https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 下的文件还是 https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/train 下的文件后面两个目录下的内容应该是一样，https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 中的内容是什么，是后续构造的更多的偏好数据吗，构造方法和https://huggingface.co/datasets/PKU-Alignment/align-anything/tree/main/text-image-to-text/new 中的数据的构造方法一样吗？

@Gaiejj @XuyaoWang @yongzhemiaolegemi @cby-pku @htlou 期待您的解答！

Oct 26 '25 03:10 zhang123434

感谢对align-anything数据集的关注～在align-anything的text-image-to-text subset中，使用的只有https://huggingface.co/datasets/PKU-Alignment/align-anything/blob/main/text-image-to-text/train.parquet 是该subset的最新版本，你可以通过以下在Readme中的示例代码直接获取数据集：

train_dataset = load_dataset('PKU-Alignment/align-anything',name='text-image-to-text')['train']

Nov 12 '25 07:11 d4yz3ro