LightCompress icon indicating copy to clipboard operation
LightCompress copied to clipboard

自定义校准数据集文档与代码对不上

Open xinyubai1209 opened this issue 9 months ago • 3 comments

doc中要求自定义数据集为txt格式,每行为一个文本

calib:
    name: custom
    download: False
    load_from_txt: True
    path: # Custom dataset, ending with txt as suffix
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: random_truncate_txt
    seed: *seed

运行报错:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/llmc/llmc/__main__.py", line 248, in <module>
[rank0]:     main(config)
[rank0]:   File "/home/llmc/llmc/__main__.py", line 51, in main
[rank0]:     dataset = BaseDataset(
[rank0]:   File "/home/llmc/llmc/data/dataset/base_dataset.py", line 45, in __init__
[rank0]:     self.build_calib_dataset()
[rank0]:   File "/home/llmc/llmc/data/dataset/base_dataset.py", line 77, in build_calib_dataset
[rank0]:     self.calib_dataset = load_from_disk(self.calib_dataset_path)
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/datasets/load.py", line 2638, in load_from_disk
[rank0]:     raise FileNotFoundError(
[rank0]: FileNotFoundError: Directory /home/datasets/ppo_train.txt is neither a `Dataset` directory nor a `DatasetDict` directory.

相关代码块并没有对“custom”的处理逻辑:

def build_calib_dataset(self):
        if self.download:
            if self.calib_dataset_name == 'pileval':
                self.calib_dataset = load_dataset(
                    'mit-han-lab/pile-val-backup', split='validation'
                )
            elif self.calib_dataset_name == 'c4':
                self.calib_dataset = load_dataset(
                    'allenai/c4',
                    data_files={'train': 'en/c4-train.00000-of-01024.json.gz'},
                    split='train',
                )
            elif self.calib_dataset_name == 'wikitext2':
                self.calib_dataset = load_dataset(
                    'wikitext', 'wikitext-2-raw-v1', split='train'
                )
            elif self.calib_dataset_name == 'ptb':
                self.calib_dataset = load_dataset(
                    'ptb_text_only', 'penn_treebank', split='train'
                )
            elif self.calib_dataset_name == 'ultrachat':
                self.calib_dataset = load_dataset(
                    'HuggingFaceH4/ultrachat_200k', split='train_sft'
                )
            else:
                raise Exception(f'Not support {self.calib_dataset_name} dataset.')
        else:
            if self.calib_dataset_name == 'custom_txt' or self.calib_dataset_name == 'custom_mm' or self.calib_dataset_name == 'images': # noqa
                self.calib_dataset = self.get_cutomdata(self.calib_dataset_path)
            else:
                self.calib_dataset = load_from_disk(self.calib_dataset_path)

xinyubai1209 avatar Apr 14 '25 08:04 xinyubai1209

我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑

xinyubai1209 avatar Apr 14 '25 09:04 xinyubai1209

我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑

@helloyongyang

gushiqiao avatar May 07 '25 08:05 gushiqiao

我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑

已经踩坑了,数据集格式长什么样可以告知一下吗?是不是文件名也得固定成samples.json?

gooooood-coder avatar Aug 06 '25 07:08 gooooood-coder