CPM-Bee
CPM-Bee copied to clipboard
预训练数据格式
运行pretrain_cpm_bee.sh脚本 修改了dataset指定datasets.json
[
{
"dataset_name": "pretrain",
"task_name": "mlm",
"weight": 1.0,
"path": "/home/litao/ScienGU/CPM-Bee/sciengu/zhinan/bin_data",
"transforms": [
{
"answer": "$answer",
"document": "$source"
},
{
"answer": "$answer",
"query": "$source"
},
{
"answer": "$answer",
"input": "$source"
}
]
}
]
里面的path,使其根据自己的数据进行处理 transhformers字段不太明白,希望能解释下
下面是引用的数据
{"answer": "当前现代医学的主要治疗甲状腺药物", "input": "当前现代医学的主要治疗甲状腺药物"}
下面是报错信息
Traceback (most recent call last):
File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
batch = packer.add_data(config[ds_id])
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
) = self.build_instance(config)
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 439, in build_instance
inp = ds.read()
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 554, in read
next_block_id = self._get_next_block()
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 394, in _get_next_block
raise RuntimeError("Empty dataset {}".format(self._path))
RuntimeError: Empty dataset /home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/sciengu/zhinan/bin_data
Process Process-1:
Traceback (most recent call last):
File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
batch = packer.add_data(config[ds_id])
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
) = self.build_instance(config)
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 440, in build_instance
inp = self.apply_transform(inp, transform)
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 344, in apply_transform
_expand_mapping(data, [], src[1:].split("."), tgt.split("."))
File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 338, in _expand_mapping
_expand_mapping(data[path[0]], stars, path[1:], target)
KeyError: 'source'
你好大佬,请问跑通了吗
没有啊,没人回复都
您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中
您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中
大佬说的是对的,亲证可以。修改cpm_live/dataset/distributed_dataset.py中的DEFAULT_BLOCK_SIZE=16<<10