data-juicer
data-juicer copied to clipboard
数据增强是不是只能用于单字段的json,不能用于多字段的
比如
{"text":"包含同音字替换测试:今天天气很好,我们去公园玩."}
这个可以
{"info":"文本", "text":"包含同音字替换测试:今天天气很好,我们去公园玩."}
就不可以
使用中报错了
2025-11-11 05:33:10.943 | ERROR | data_juicer.core.data.dj_dataset:317 - An error occurred during Op [nlpcda_zh_mapper].
Traceback (most recent call last):
File "/data-juicer/data_juicer/core/data/dj_dataset.py", line 297, in process
dataset, resource_util_per_op = Monitor.monitor_func(op.run, args=run_args)
File "/data-juicer/data_juicer/core/monitor.py", line 225, in monitor_func
ret = func()
File "/data-juicer/data_juicer/ops/base_op.py", line 377, in run
new_dataset = dataset.map(
File "/data-juicer/data_juicer/core/data/dj_dataset.py", line 401, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3318, in map
for rank, done, content in Dataset._map_single(**unprocessed_kwargs):
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3689, in _map_single
writer.write_batch(batch, try_original_type=try_original_type)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 630, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 4868, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 4214, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1 named text expected length 648 but got length 36
字段行数不一致