LightCompress
LightCompress copied to clipboard
自定义校准数据集文档与代码对不上
doc中要求自定义数据集为txt格式,每行为一个文本
calib:
name: custom
download: False
load_from_txt: True
path: # Custom dataset, ending with txt as suffix
n_samples: 128
bs: -1
seq_len: 512
preproc: random_truncate_txt
seed: *seed
运行报错:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/llmc/llmc/__main__.py", line 248, in <module>
[rank0]: main(config)
[rank0]: File "/home/llmc/llmc/__main__.py", line 51, in main
[rank0]: dataset = BaseDataset(
[rank0]: File "/home/llmc/llmc/data/dataset/base_dataset.py", line 45, in __init__
[rank0]: self.build_calib_dataset()
[rank0]: File "/home/llmc/llmc/data/dataset/base_dataset.py", line 77, in build_calib_dataset
[rank0]: self.calib_dataset = load_from_disk(self.calib_dataset_path)
[rank0]: File "/root/miniconda3/lib/python3.10/site-packages/datasets/load.py", line 2638, in load_from_disk
[rank0]: raise FileNotFoundError(
[rank0]: FileNotFoundError: Directory /home/datasets/ppo_train.txt is neither a `Dataset` directory nor a `DatasetDict` directory.
相关代码块并没有对“custom”的处理逻辑:
def build_calib_dataset(self):
if self.download:
if self.calib_dataset_name == 'pileval':
self.calib_dataset = load_dataset(
'mit-han-lab/pile-val-backup', split='validation'
)
elif self.calib_dataset_name == 'c4':
self.calib_dataset = load_dataset(
'allenai/c4',
data_files={'train': 'en/c4-train.00000-of-01024.json.gz'},
split='train',
)
elif self.calib_dataset_name == 'wikitext2':
self.calib_dataset = load_dataset(
'wikitext', 'wikitext-2-raw-v1', split='train'
)
elif self.calib_dataset_name == 'ptb':
self.calib_dataset = load_dataset(
'ptb_text_only', 'penn_treebank', split='train'
)
elif self.calib_dataset_name == 'ultrachat':
self.calib_dataset = load_dataset(
'HuggingFaceH4/ultrachat_200k', split='train_sft'
)
else:
raise Exception(f'Not support {self.calib_dataset_name} dataset.')
else:
if self.calib_dataset_name == 'custom_txt' or self.calib_dataset_name == 'custom_mm' or self.calib_dataset_name == 'images': # noqa
self.calib_dataset = self.get_cutomdata(self.calib_dataset_path)
else:
self.calib_dataset = load_from_disk(self.calib_dataset_path)
我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑
我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑
@helloyongyang
我看代码,新版的自定义数据集是json格式了,还分txt、mm这些,辛苦更新一下文档吧,避免有些同学踩坑
已经踩坑了,数据集格式长什么样可以告知一下吗?是不是文件名也得固定成samples.json?