soda-core
soda-core copied to clipboard
spark/databricks execution error AttributeError: 'NoneType' object has no attribute 'sql'
Environment:
$ poetry show | grep soda [10:26:29]
soda-core 3.0.38 Soda Core library & CLI
soda-core-spark 3.0.38
soda-core-spark-df 3.0.38
Configuration (specifics redacted):
data_source databricks:
type: spark_df
catalog: spark_catalog
schema: <schema>
method: databricks
host: ${DATABRICKS_HOST}
http_path: <http_path>
token: ${DATABRICKS_TOKEN}
Check file:
checks for <table>:
- row_count > 0:
name: The table is not empty
- missing_count(column_name) = 0:
name: Ensure there are no null values in the column_name column
I'm running the following command:
poetry run soda scan -V -d databricks -c soda/config/databricks.yml soda/checks/<table>/check.yml
And I'm getting the following error:
[10:30:27] Soda Core 3.0.38
[10:30:27] Reading configuration file "soda/config/databricks.yml"
[10:30:27] Reading SodaCL file "soda/checks/<table>/check.yml"
[10:30:27] Scan execution starts
[10:30:27] Query databricks.<table>.aggregation[0]:
SELECT
COUNT(*),
COUNT(CASE WHEN <column_name> IS NULL THEN 1 END)
FROM <schema>.<table>
[10:30:27] Query execution error in databricks.<table>.aggregation[0]: 'NoneType' object has no attribute 'sql'
SELECT
COUNT(*),
COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
| 'NoneType' object has no attribute 'sql'
| Stacktrace:
| Traceback (most recent call last):
| File "<path>/.venv/lib/python3.10/site-packages/soda/execution/query/query.py", line 122, in fetchone
| cursor.execute(self.sql)
| File "<path>/.venv/lib/python3.10/site-packages/soda/data_sources/spark_df_cursor.py", line 15, in execute
| self.df = self.spark_session.sql(sqlQuery=sql)
| AttributeError: 'NoneType' object has no attribute 'sql'
[10:30:27] Metrics 'row_count' were not computed for check 'row_count > 0'
[10:30:27] Metrics 'missing_count' were not computed for check 'missing_count(column_name) = 0'
[10:30:27] Scan summary:
[10:30:27] 1/1 query ERROR
[10:30:27] databricks.<table>.aggregation[0] [ERROR] 0:00:00.000738
SELECT
COUNT(*),
COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
[10:30:27] 'NoneType' object has no attribute 'sql'
[10:30:27] 2/2 checks NOT EVALUATED:
[10:30:27] <table> in databricks
[10:30:27] The table is not empty [soda/checks/<table>/check.yml] [NOT EVALUATED]
[10:30:27] check_value: None
[10:30:27] Ensure there are no null values in the column_name column [soda/checks/<table>/check.yml] [NOT EVALUATED]
[10:30:27] check_value: None
[10:30:27] 2 checks not evaluated.
[10:30:27] 3 errors.
[10:30:27] Oops! 3 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
[10:30:27] Query execution error in databricks.<table>.aggregation[0]: 'NoneType' object has no attribute 'sql'
SELECT
COUNT(*),
COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
| 'NoneType' object has no attribute 'sql'
| Stacktrace:
| Traceback (most recent call last):
| File "<path>/.venv/lib/python3.10/site-packages/soda/execution/query/query.py", line 122, in fetchone
| cursor.execute(self.sql)
| File "<path>/.venv/lib/python3.10/site-packages/soda/data_sources/spark_df_cursor.py", line 15, in execute
| self.df = self.spark_session.sql(sqlQuery=sql)
| AttributeError: 'NoneType' object has no attribute 'sql'
[10:30:27] Metrics 'row_count' were not computed for check 'row_count > 0'
[10:30:27] Metrics 'missing_count' were not computed for check 'missing_count(column_name) = 0'
After doing some debugging it looks like the SparkDfCursor
object is not being initialized properly. The spark_session
object is coming in None
which is causing the error here.
Is this a bug or is there something wrong with my configuration (user error)?
SODA-1742
Hey @drsnyder spark_df
type with python/pyspark. Can you try specifying the type: spark
- please see https://docs.soda.io/soda/connect-spark.html#connect-to-spark-for-databricks-sql
Hey @drsnyder
spark_df
type with python/pyspark. Can you try specifying thetype: spark
- please see https://docs.soda.io/soda/connect-spark.html#connect-to-spark-for-databricks-sql
If I use type: spark
the client is unable to connect to the databricks server.
[11:04:13] Could not connect to data source "databricks": Encountered a problem while trying to connect to spark: Error during request to server
| Encountered a problem while trying to connect to spark: Error during request to server
I'm not sure if it's related but the code seems to suggest that spark_df
is the expected value. I've also tried using soda
directly in a databricks notebook and in that context the following fails:
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name(f"spark") # <---
scan.add_spark_session(spark)
But changing the 3rd line to scan.set_data_source_name(f"spark_df")
works.
If you are using with Python/Notebook then indeed you should use spark_df, the spark type is for connecting to Databricks via Databricks spark connector.
If you are using with Python/Notebook then indeed you should use spark_df, the spark type is for connecting to Databricks via Databricks spark connector.
@vijaykiran That doesn't work for me inside databricks. As mentioned above, I have to use "spark"
.
This specific issue is separate-- I'm trying to use soda from the command line to validate a table in databricks. It could be that I'm doing something wrong with the configuration (help!). There could also be an issue with the code.