data-juicer [Bug]:

[Bug]:

Open FailedNamed opened this issue 1 year ago • 0 comments

Before Reporting 报告之前

[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.9.19

Describe the bug 描述这个bug

执行 python -m tests.core.test_adapter 报错

To Reproduce 如何复现

在项目根目录执行 python -m tests.core.test_adapter
出现报错，经过定位应该是在Filter.run的这段代码 dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats') 中的initial_value有问题，是个空字典，这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环，没执行计算，后面报错KeyError: 'perplexity'，参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows（ps：不知道要不要乘以这个rows），后执行，PerplexityFilter算子不再报错
继续执行，PerplexityFilter算子不再报错，但是DocumentDeduplicator算子报错，信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash return hashlib.md5(txt.strip().encode('utf-8')).hexdigest() AttributeError: 'list' object has no attribute 'strip'，看了下代码，是因为前置的FixUnicodeMapper算子处理完数据后， samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key])) samples[self.text_key]是一个数组，导致DocumentDeduplicator算子执行_get_hash处理时报错看了下其他mapper算子，貌似输出的samples[self.text_key]有许多格式，数组，字典，字符串都有，但是strip应该只支持字符串，是不是这些算子之间的兼容性处理的不够好，其他算子是否也有类似问题
麻烦有空帮忙解答下，感谢！

Configs 配置信息

No response

Logs 报错日志

Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor if mdict['stop']: File "", line 2, in getitem File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

ERROR: test_execute_and_probe (main.AdapterTest)

Traceback (most recent call last): File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe resource_util_list = Adapter.execute_and_probe(ds, ops) File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe dataset, resource_util_per_op = Monitor.monitor_func( File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func ret = func() File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run new_dataset = dataset.filter(self.process, File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper out = func(dataset, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter indices = self.map( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(**dataset_kwargs): File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single batch = apply_function_on_filtered_inputs( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function mask.append(function(example, *additional_args, **fn_kwargs)) File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f return f(*args, **kargs) File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl KeyError: 'perplexity'

Screenshots 截图

No response

Additional 额外信息

No response

Sep 29 '24 09:09 FailedNamed

data-juicer data-juicer copied to clipboard

[Bug]:

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

ERROR: test_execute_and_probe (main.AdapterTest)

Screenshots 截图

Additional 额外信息

data-juicer
data-juicer copied to clipboard