haystack
haystack copied to clipboard
Schema validation does not allow for null values in case of `Optional` typed params
Describe the bug
Setting optional params to null
in YAML (or to None
when it's been converted to a python dict) will cause the schema validation to fail: see https://github.com/deepset-ai/haystack/issues/2586
Error message
PipelineSchemaError: Node of type XYZ found, but it failed validation. Possible causes:
- The node is missing some mandatory parameter
- Wrong indentation of some parameter in YAML
See the stacktrace for more information.```
Expected behavior
null
values are accepted for Optional
typed params.
To Reproduce Load following pipeline:
version: ignore
components:
- name: ESRetriever
type: BM25Retriever
params:
document_store: DocumentStore
custom_query: null
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
index: haystack_test
label_index: haystack_test_label
pipelines:
- name: query_pipeline
nodes:
- name: ESRetriever
inputs: [Query]
Extended stacktrace:
>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
<haystack.pipelines.base.Pipeline object at 0x7fcfcd78d810>
>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
INFO - haystack.pipelines.config - Missing definition for node of type BM25Retriever. Looking into local classes...
INFO - haystack.nodes._json_schema - Creating schema for 'BM25Retriever'
INFO - haystack.nodes._json_schema - Added definition for BM25Retriever
Traceback (most recent call last):
File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 259, in validate_schema
Draft7Validator(schema).validate(instance=pipeline_config)
File "/home/tstad/miniconda3/envs/haystack-dev/lib/python3.7/site-packages/jsonschema/validators.py", line 254, in validate
raise error
jsonschema.exceptions.ValidationError: {'name': 'ESRetriever', 'type': 'BM25Retriever', 'params': {'document_store': 'DocumentStore', 'custom_query': None}} is not valid under any of the given schemas
Failed validating 'anyOf' in schema['properties']['components']['items']:
{'anyOf': [{'$ref': '#/definitions/DeepsetCloudDocumentStoreComponent'},
{'$ref': '#/definitions/ElasticsearchDocumentStoreComponent'},
{'$ref': '#/definitions/FAISSDocumentStoreComponent'},
{'$ref': '#/definitions/GraphDBKnowledgeGraphComponent'},
{'$ref': '#/definitions/InMemoryDocumentStoreComponent'},
{'$ref': '#/definitions/Milvus2DocumentStoreComponent'},
{'$ref': '#/definitions/OpenDistroElasticsearchDocumentStoreComponent'},
{'$ref': '#/definitions/OpenSearchDocumentStoreComponent'},
{'$ref': '#/definitions/PineconeDocumentStoreComponent'},
{'$ref': '#/definitions/SQLDocumentStoreComponent'},
{'$ref': '#/definitions/WeaviateDocumentStoreComponent'},
{'$ref': '#/definitions/AzureConverterComponent'},
{'$ref': '#/definitions/BM25RetrieverComponent'},
{'$ref': '#/definitions/CrawlerComponent'},
{'$ref': '#/definitions/DensePassageRetrieverComponent'},
{'$ref': '#/definitions/Docs2AnswersComponent'},
{'$ref': '#/definitions/DocxToTextConverterComponent'},
{'$ref': '#/definitions/ElasticsearchFilterOnlyRetrieverComponent'},
{'$ref': '#/definitions/ElasticsearchRetrieverComponent'},
{'$ref': '#/definitions/EmbeddingRetrieverComponent'},
{'$ref': '#/definitions/EntityExtractorComponent'},
{'$ref': '#/definitions/EvalAnswersComponent'},
{'$ref': '#/definitions/EvalDocumentsComponent'},
{'$ref': '#/definitions/FARMReaderComponent'},
{'$ref': '#/definitions/FileTypeClassifierComponent'},
{'$ref': '#/definitions/FilterRetrieverComponent'},
{'$ref': '#/definitions/ImageToTextConverterComponent'},
{'$ref': '#/definitions/JoinAnswersComponent'},
{'$ref': '#/definitions/JoinDocumentsComponent'},
{'$ref': '#/definitions/MarkdownConverterComponent'},
{'$ref': '#/definitions/PDFToTextConverterComponent'},
{'$ref': '#/definitions/PDFToTextOCRConverterComponent'},
{'$ref': '#/definitions/ParsrConverterComponent'},
{'$ref': '#/definitions/PreProcessorComponent'},
{'$ref': '#/definitions/QuestionGeneratorComponent'},
{'$ref': '#/definitions/RAGeneratorComponent'},
{'$ref': '#/definitions/RCIReaderComponent'},
{'$ref': '#/definitions/RouteDocumentsComponent'},
{'$ref': '#/definitions/SentenceTransformersRankerComponent'},
{'$ref': '#/definitions/Seq2SeqGeneratorComponent'},
{'$ref': '#/definitions/SklearnQueryClassifierComponent'},
{'$ref': '#/definitions/TableReaderComponent'},
{'$ref': '#/definitions/TableTextRetrieverComponent'},
{'$ref': '#/definitions/Text2SparqlRetrieverComponent'},
{'$ref': '#/definitions/TextConverterComponent'},
{'$ref': '#/definitions/TfidfRetrieverComponent'},
{'$ref': '#/definitions/TikaConverterComponent'},
{'$ref': '#/definitions/TransformersDocumentClassifierComponent'},
{'$ref': '#/definitions/TransformersQueryClassifierComponent'},
{'$ref': '#/definitions/TransformersReaderComponent'},
{'$ref': '#/definitions/TransformersSummarizerComponent'},
{'$ref': '#/definitions/TransformersTranslatorComponent'},
{'$ref': '#/definitions/BM25RetrieverComponent'}]}
On instance['components'][0]:
{'name': 'ESRetriever',
'params': {'custom_query': None, 'document_store': 'DocumentStore'},
'type': 'BM25Retriever'}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1334, in load_from_yaml
strict_version_check=strict_version_check,
File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1391, in load_from_config
validate_config(pipeline_config, strict_version_check=strict_version_check)
File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 203, in validate_config
validate_schema(pipeline_config=pipeline_config, strict_version_check=strict_version_check)
File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 283, in validate_schema
) from validation
haystack.errors.PipelineSchemaError: Node of type BM25Retriever found, but it failed validation. Possible causes:
- The node is missing some mandatory parameter
- Wrong indentation of some parameter in YAML
See the stacktrace for more information.
Maybe extending the param types with an anyOf like in https://stackoverflow.com/questions/22565005/json-schema-validate-a-number-or-null-value could help.
Note: the suggestion above is valid, so in order to make optional parameter accept a null
, we should write the schema as:
"anyOf": [
{"type": "string"},
{"type": "null"}
]
However, the node's parameter's schema is currently build entirely by Pydantic: https://github.com/deepset-ai/haystack/blob/0e83535108d0742735ecc4a8c8d93f555839a54b/haystack/nodes/_json_schema.py#L179
So at a first glance it seems like we can't easily change the output by just staying within Haystack's code.
Next steps:
- Dig more into this method call to see if we can influence it in any way to return the above signature, or modify the output downstream somehow,
- See if this is worth a PR on Pydantic in case the issue can only be solved with monkeypatching.
I tried to investigate this problem, but it's not easy :smile:.
- First, in the YAML example reported by @tstadel,
custom_query
isnull
but now, after #2789, this parameter has a special validation, which doesn't raise thePipelineSchemaError
. So, I propose another YAML configuration for experimenting (I hope that it is correct):
version: ignore
components:
- name: MyReader
type: FARMReader
params:
model_name_or_path: deepset/tinyroberta-squad2
model_version: null
pipelines:
- name: example_pipeline
nodes:
- name: MyReader
inputs: [Query]
- In the pydantic community, this problem is old and difficult; they want to solve this issue in v2 (probably by the end of 2022). More information and some non-definitive monkey patches here: https://github.com/samuelcolvin/pydantic/issues/1270
Hey @anakin87! Thanks for jumping it! :blush:
About point 1 you're right: custom_query
is no more a good experiment subject here. Any other Optional[Any]
value should reproduce this bug though, so your example seems to match the issue.
About point 2: Wow the issue you link is really deep. I did not expect this to be so rooted into Pydantic... I think we could live with the monkeypatch while waiting for v2 to be out. The JSON schema generation in Haystack is not pretty anyway :smile: So I'd propose the following course of action, if you're up for it:
- Write a test on this issue that fails with the current code
- Add the monkeypatch in
haystack/nodes/_json_schema.py
(you'll notice is not the first one :sweat_smile:) - See if the test pass and adjust at need.
- Once it's all green again, add a big disclaimed over the monkeypatch linking to the issue, and one more comment over the test mentioning that it should be removed (or at least reviewed) during the migration to Pydantic 2
- If you want to make it really fancy, you can even make the test fail if the Pydantic version is above 2.0, so we will be sure not to forget that :grin: But that's not necessary, a comment would suffice too
If you get stuck do not insist too much though. For as weird as it is, this issue is not mission critical. If we have to wait for Pydantic v2, so be it.
Have fun!