great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

return_unexpected_index_query returning broken query, escaping double quotes in result for SparkDFExecutionEngine

Open jyoti-thakkar opened this issue 1 month ago • 0 comments

Describe the bug we have set result_format as "COMPLETE" and return_unexpected_index_query to true, we want to use return_unexpected_index_query to get the error records from dataframe. It seems that it is returning broken query and escaping double quotes. Example : return unexpected index query by GX : df.filter(F.expr(NOT(city IS NOT NULL))) working query : df.filter(F.expr("NOT(city IS NOT NULL)"))

To Reproduce import great_expectations as ge from great_expectations.core import ExpectationSuite from great_expectations.core.batch import RuntimeBatchRequest from great_expectations.data_context import BaseDataContext from great_expectations.data_context.types.base import FilesystemStoreBackendDefaults,DataContextConfig, DatasourceConfig

expectation_suite_config = { "expectation_suite_name": "my_expectation_suite", "expectations": [ # List of expectations { "expectation_type": "expect_column_values_to_not_be_null", "kwargs": { "column": "my_column", "result_format": {"result_format": "COMPLETE"}

        }
    }
]

}

my_expectation_suite = ExpectationSuite(**expectation_suite_config)

Define DataContext configuration

data_context_config = DataContextConfig( plugins_directory=None, config_variables_file_path= None, datasources={ "my_spark_datasource": DatasourceConfig( class_name= "Datasource", execution_engine={ "class_name": "SparkDFExecutionEngine", "force_reuse_spark_context": True,

        },
        data_connectors={
            "spark_runtime_dataconnector":{
                "class_name": "RuntimeDataConnector",
                "module_name":"great_expectations.datasource.data_connector",
                "batch_identifiers": ["batch_name"]
            },
        },
    )
},
store_backend_defaults=FilesystemStoreBackendDefaults(root_directory="/"),

)

batch_request=RuntimeBatchRequest(datasource_name="my_spark_datasource", data_connector_name="spark_runtime_dataconnector", data_asset_name="my_asset", runtime_parameters={"batch_data": df}, batch_identifiers={"batch_name": "batch_run"})

context = ge.get_context(project_config=data_context_config) batch_validator = context.get_validator(batch_request=batch_request, expectation_suite=my_expectation_suite) validation_result = batch_validator.validate() print(validation_result)

validation_result contains unexpected_index_query value as "df.filter(F.expr(NOT(city IS NOT NULL)))"

when i execute this query it is giving error syntax error. Invalid syntax

Expected behavior Executing Query should result into getting error records from dataframe

Environment (please complete the following information):

  • Operating System: [e.g. Linux, MacOS, Windows] --> Windows
  • Great Expectations Version: [e.g. 0.13.2] --> 0.18.13
  • Data Source: [e.g. Pandas, Snowflake] --> spark dataframe
  • Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP] --> Databricks

Additional context Add any other context about the problem here.

jyoti-thakkar avatar May 17 '24 10:05 jyoti-thakkar