haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Schema validation does not allow for null values in case of `Optional` typed params

Open tstadel opened this issue 2 years ago • 5 comments

Describe the bug Setting optional params to null in YAML (or to None when it's been converted to a python dict) will cause the schema validation to fail: see https://github.com/deepset-ai/haystack/issues/2586

Error message

PipelineSchemaError: Node of type XYZ found, but it failed validation. Possible causes:
 - The node is missing some mandatory parameter
 - Wrong indentation of some parameter in YAML
See the stacktrace for more information.```

Expected behavior null values are accepted for Optional typed params.

To Reproduce Load following pipeline:

version: ignore

components:
  - name: ESRetriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: null
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      index: haystack_test
      label_index: haystack_test_label


pipelines:
  - name: query_pipeline
    nodes:
      - name: ESRetriever
        inputs: [Query]

tstadel avatar May 23 '22 11:05 tstadel

Extended stacktrace:

>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
<haystack.pipelines.base.Pipeline object at 0x7fcfcd78d810>
>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
INFO - haystack.pipelines.config -  Missing definition for node of type BM25Retriever. Looking into local classes...
INFO - haystack.nodes._json_schema -  Creating schema for 'BM25Retriever'
INFO - haystack.nodes._json_schema -  Added definition for BM25Retriever
Traceback (most recent call last):
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 259, in validate_schema
    Draft7Validator(schema).validate(instance=pipeline_config)
  File "/home/tstad/miniconda3/envs/haystack-dev/lib/python3.7/site-packages/jsonschema/validators.py", line 254, in validate
    raise error
jsonschema.exceptions.ValidationError: {'name': 'ESRetriever', 'type': 'BM25Retriever', 'params': {'document_store': 'DocumentStore', 'custom_query': None}} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['components']['items']:
    {'anyOf': [{'$ref': '#/definitions/DeepsetCloudDocumentStoreComponent'},
               {'$ref': '#/definitions/ElasticsearchDocumentStoreComponent'},
               {'$ref': '#/definitions/FAISSDocumentStoreComponent'},
               {'$ref': '#/definitions/GraphDBKnowledgeGraphComponent'},
               {'$ref': '#/definitions/InMemoryDocumentStoreComponent'},
               {'$ref': '#/definitions/Milvus2DocumentStoreComponent'},
               {'$ref': '#/definitions/OpenDistroElasticsearchDocumentStoreComponent'},
               {'$ref': '#/definitions/OpenSearchDocumentStoreComponent'},
               {'$ref': '#/definitions/PineconeDocumentStoreComponent'},
               {'$ref': '#/definitions/SQLDocumentStoreComponent'},
               {'$ref': '#/definitions/WeaviateDocumentStoreComponent'},
               {'$ref': '#/definitions/AzureConverterComponent'},
               {'$ref': '#/definitions/BM25RetrieverComponent'},
               {'$ref': '#/definitions/CrawlerComponent'},
               {'$ref': '#/definitions/DensePassageRetrieverComponent'},
               {'$ref': '#/definitions/Docs2AnswersComponent'},
               {'$ref': '#/definitions/DocxToTextConverterComponent'},
               {'$ref': '#/definitions/ElasticsearchFilterOnlyRetrieverComponent'},
               {'$ref': '#/definitions/ElasticsearchRetrieverComponent'},
               {'$ref': '#/definitions/EmbeddingRetrieverComponent'},
               {'$ref': '#/definitions/EntityExtractorComponent'},
               {'$ref': '#/definitions/EvalAnswersComponent'},
               {'$ref': '#/definitions/EvalDocumentsComponent'},
               {'$ref': '#/definitions/FARMReaderComponent'},
               {'$ref': '#/definitions/FileTypeClassifierComponent'},
               {'$ref': '#/definitions/FilterRetrieverComponent'},
               {'$ref': '#/definitions/ImageToTextConverterComponent'},
               {'$ref': '#/definitions/JoinAnswersComponent'},
               {'$ref': '#/definitions/JoinDocumentsComponent'},
               {'$ref': '#/definitions/MarkdownConverterComponent'},
               {'$ref': '#/definitions/PDFToTextConverterComponent'},
               {'$ref': '#/definitions/PDFToTextOCRConverterComponent'},
               {'$ref': '#/definitions/ParsrConverterComponent'},
               {'$ref': '#/definitions/PreProcessorComponent'},
               {'$ref': '#/definitions/QuestionGeneratorComponent'},
               {'$ref': '#/definitions/RAGeneratorComponent'},
               {'$ref': '#/definitions/RCIReaderComponent'},
               {'$ref': '#/definitions/RouteDocumentsComponent'},
               {'$ref': '#/definitions/SentenceTransformersRankerComponent'},
               {'$ref': '#/definitions/Seq2SeqGeneratorComponent'},
               {'$ref': '#/definitions/SklearnQueryClassifierComponent'},
               {'$ref': '#/definitions/TableReaderComponent'},
               {'$ref': '#/definitions/TableTextRetrieverComponent'},
               {'$ref': '#/definitions/Text2SparqlRetrieverComponent'},
               {'$ref': '#/definitions/TextConverterComponent'},
               {'$ref': '#/definitions/TfidfRetrieverComponent'},
               {'$ref': '#/definitions/TikaConverterComponent'},
               {'$ref': '#/definitions/TransformersDocumentClassifierComponent'},
               {'$ref': '#/definitions/TransformersQueryClassifierComponent'},
               {'$ref': '#/definitions/TransformersReaderComponent'},
               {'$ref': '#/definitions/TransformersSummarizerComponent'},
               {'$ref': '#/definitions/TransformersTranslatorComponent'},
               {'$ref': '#/definitions/BM25RetrieverComponent'}]}

On instance['components'][0]:
    {'name': 'ESRetriever',
     'params': {'custom_query': None, 'document_store': 'DocumentStore'},
     'type': 'BM25Retriever'}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1334, in load_from_yaml
    strict_version_check=strict_version_check,
  File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1391, in load_from_config
    validate_config(pipeline_config, strict_version_check=strict_version_check)
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 203, in validate_config
    validate_schema(pipeline_config=pipeline_config, strict_version_check=strict_version_check)
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 283, in validate_schema
    ) from validation
haystack.errors.PipelineSchemaError: Node of type BM25Retriever found, but it failed validation. Possible causes:
 - The node is missing some mandatory parameter
 - Wrong indentation of some parameter in YAML
See the stacktrace for more information.

tstadel avatar May 23 '22 11:05 tstadel

Maybe extending the param types with an anyOf like in https://stackoverflow.com/questions/22565005/json-schema-validate-a-number-or-null-value could help.

tstadel avatar May 23 '22 11:05 tstadel

Note: the suggestion above is valid, so in order to make optional parameter accept a null, we should write the schema as:

"anyOf": [
  {"type": "string"},
  {"type": "null"}
]

However, the node's parameter's schema is currently build entirely by Pydantic: https://github.com/deepset-ai/haystack/blob/0e83535108d0742735ecc4a8c8d93f555839a54b/haystack/nodes/_json_schema.py#L179

So at a first glance it seems like we can't easily change the output by just staying within Haystack's code.

Next steps:

  1. Dig more into this method call to see if we can influence it in any way to return the above signature, or modify the output downstream somehow,
  2. See if this is worth a PR on Pydantic in case the issue can only be solved with monkeypatching.

ZanSara avatar May 23 '22 12:05 ZanSara

I tried to investigate this problem, but it's not easy :smile:.

  1. First, in the YAML example reported by @tstadel, custom_query is null but now, after #2789, this parameter has a special validation, which doesn't raise the PipelineSchemaError. So, I propose another YAML configuration for experimenting (I hope that it is correct):
version: ignore

components:
  - name: MyReader
    type: FARMReader
    params:
      model_name_or_path: deepset/tinyroberta-squad2
      model_version: null         

pipelines:
  - name: example_pipeline
    nodes:
      - name: MyReader
        inputs: [Query]  
  1. In the pydantic community, this problem is old and difficult; they want to solve this issue in v2 (probably by the end of 2022). More information and some non-definitive monkey patches here: https://github.com/samuelcolvin/pydantic/issues/1270

anakin87 avatar Jul 20 '22 22:07 anakin87

Hey @anakin87! Thanks for jumping it! :blush:

About point 1 you're right: custom_query is no more a good experiment subject here. Any other Optional[Any] value should reproduce this bug though, so your example seems to match the issue.

About point 2: Wow the issue you link is really deep. I did not expect this to be so rooted into Pydantic... I think we could live with the monkeypatch while waiting for v2 to be out. The JSON schema generation in Haystack is not pretty anyway :smile: So I'd propose the following course of action, if you're up for it:

  1. Write a test on this issue that fails with the current code
  2. Add the monkeypatch in haystack/nodes/_json_schema.py (you'll notice is not the first one :sweat_smile:)
  3. See if the test pass and adjust at need.
  4. Once it's all green again, add a big disclaimed over the monkeypatch linking to the issue, and one more comment over the test mentioning that it should be removed (or at least reviewed) during the migration to Pydantic 2
    • If you want to make it really fancy, you can even make the test fail if the Pydantic version is above 2.0, so we will be sure not to forget that :grin: But that's not necessary, a comment would suffice too

If you get stuck do not insist too much though. For as weird as it is, this issue is not mission critical. If we have to wait for Pydantic v2, so be it.

Have fun!

ZanSara avatar Jul 21 '22 10:07 ZanSara