WeClone icon indicating copy to clipboard operation
WeClone copied to clipboard

数据预处理出现问题

Open NARIKA3 opened this issue 7 months ago • 5 comments

[WeClone] I | 20:05:01 | Loading configuration from: ./settings.jsonc [WeClone] I | 20:05:01 | 聊天记录禁用词: ['例如 密码', '例如 姓名', '//.....'] [WeClone] I | 20:05:01 | 共发现 1 个 CSV 文件,开始处理 [WeClone] D | 20:05:01 | 开始处理 CSV 文件: ./dataset/csv\55954793313@chatroom\55954793313@chatroom_0_64.csv [WeClone] D | 20:05:01 | 处理完成: ./dataset/csv\55954793313@chatroom\55954793313@chatroom_0_64.csv,共加载 51 条消息 [WeClone] S | 20:05:01 | 聊天记录处理成功,共0条,保存到 ./dataset/res_csv/sft/sft-my.json [WeClone] I | 20:05:09 | 开始计算cutoff_len...... Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. Generating train split: 0 examples [00:00, ? examples/s] Traceback (most recent call last): File "D:\weclone\Weclone\weclone\utils\length_cdf.py", line 73, in fire.Fire(length_cdf) File "D:\weclone\Weclone.venv\Lib\site-packages\fire\core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\fire\core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\fire\core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone\weclone\utils\length_cdf.py", line 56, in length_cdf trainset = get_dataset(template, model_args, data_args, training_args, "sft", **tokenizer_module)["train_dataset"] # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\llamafactory\data\loader.py", line 307, in get_dataset dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\llamafactory\data\loader.py", line 179, in _get_merged_dataset datasets[dataset_name] = _load_single_dataset(dataset_attr, model_args, data_args, training_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\llamafactory\data\loader.py", line 128, in _load_single_dataset dataset = load_dataset( ^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\load.py", line 2163, in load_dataset ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\builder.py", line 1126, in as_dataset datasets = map_nested( ^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\utils\py_utils.py", line 484, in map_nested mapped = function(data_struct) ^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\builder.py", line 1156, in _build_single_dataset ds = self._as_dataset( ^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\builder.py", line 1230, in _as_dataset dataset_kwargs = ArrowReader(cache_dir, self.info).read( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\weclone\Weclone.venv\Lib\site-packages\datasets\arrow_reader.py", line 251, in read raise ValueError(msg) ValueError: Instruction "train" corresponds to no data! [WeClone] E | 20:05:11 | 命令 'D:\weclone\Weclone.venv\Scripts\python.exe weclone\utils\length_cdf.py --model_name_or_path="./modelQwen" --dataset="wechat-sft" --dataset_dir="./dataset/res_csv/sft" --template="qwen" --interval=256' 执行失败,返回码 1 [WeClone] S | 20:05:11 | 聊天记录处理成功,共0条,保存到 ./dataset/res_csv/sft/sft-my.json

NARIKA3 avatar May 18 '25 12:05 NARIKA3

数据太少了吧

xming521 avatar May 18 '25 14:05 xming521

大概需要多少的数据量呢

NARIKA3 avatar May 18 '25 14:05 NARIKA3

越多越好啊

xming521 avatar May 18 '25 15:05 xming521

好的,感谢大佬回复

NARIKA3 avatar May 18 '25 15:05 NARIKA3

我13万数据量也报这个错了

PaulJiang-123 avatar May 30 '25 22:05 PaulJiang-123