data-juicer
data-juicer copied to clipboard
[Bug]: validator 字段类型校验错误地读取yaml中的 field_types为str 导致字段类型校验的 isinstance 抛出异常
Before Reporting 报告之前
-
[x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
-
[x] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
- [x] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
Linux
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
latest
Python Version Python版本
3.10
Describe the bug 描述这个bug
yaml文件中配置 validator field_type(官方脚本) `validators: # validators are a list of validators to be applied when loading a dataset # it checks a sample of the dataset for each validator # check data_juicer/ore/data/data_validator.py for more validator options
- type: 'required_fields' # required_fields is a validator to check the required fields in the dataset.
required_fields: # required_fields is a list of required fields.
- "text"
field_types: # field_types is a dictionary of field types.
text: 'str'`
其中 field_types 在 data_juicer/core/data/data_validator.py 中被设置为expected_type = self.field_types.get(field)
这会导致读取到的 expected_type 为字符串类型的 str、list....
在校验时 invalid_types = [type(v) for v in sample_values if v is not None and not isinstance(v, expected_type)] 没有将 expected_type 转为 type 类型,导致抛出异常
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union
To Reproduce 如何复现
只要 validator 的yaml 文件设置 field_types 即可复现
Configs 配置信息
No response
Logs 报错日志
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union
Screenshots 截图
No response
Additional 额外信息
只需要对expected_type进行类型转换即可解决此问题