Config validation for PreProcessor's split_by parameter fails when set to null
Describe the bug
Validation fails on PreProcessor component in a YAML config, or in a config dictionary, if the parameter split_by is set to null/None.
The parameter split_by is meant to accept the values: "word", "sentence", "passage" or None (to disable splitting).
Error message
ValidationError: {'name': 'my_preprocessor', 'type': 'PreProcessor', 'params': {'split_by': None}} is not valid under any of the given schemas
Expected behavior
No validation error should occur when split_by is set to null/None.
Additional context
I fixed the problem locally by adding null to the split_by param's enum:
"split_by": {
"title": "Split By",
"default": "word",
"enum": [
"word",
"sentence",
"passage",
null
],
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
]
}
I've briefly looked but haven't found other parameters on other components where this bug exists.
To Reproduce Run this code:
from haystack.pipelines.utils import validate_schema
CONF = {
'version': '1.17.2',
'components': [
{'name': 'document_store', 'type': 'InMemoryDocumentStore'},
{'name': 'json_converter', 'type': 'JsonConverter'},
{
'name': 'my_preprocessor',
'type': 'PreProcessor',
'params': {'split_by': None}
}
],
'pipelines': [
{
'name': 'indexing',
'nodes': [
{'name': 'json_converter', 'inputs': ['File']},
{'name': 'my_preprocessor', 'inputs': ['json_converter']},
{'name': 'document_store', 'inputs': ['my_preprocessor']}
]
}
]
}
validate_schema(CONF)
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS: Linux
- GPU/CPU:
- Haystack version (commit or version number): 1.17.2
- DocumentStore: InMemoryDocumentStore
- Reader: NA
- Retriever: NA
Hey @E-dC could you try to reproduce this with the latest version? If it works I'd consider it fixed.