soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

spark/databricks execution error AttributeError: 'NoneType' object has no attribute 'sql'

Open drsnyder opened this issue 1 year ago • 5 comments

Environment:

$ poetry show | grep soda                                                                                                                                            [10:26:29]
soda-core                              3.0.38        Soda Core library & CLI
soda-core-spark                        3.0.38
soda-core-spark-df                     3.0.38

Configuration (specifics redacted):

data_source databricks:
  type: spark_df
  catalog: spark_catalog
  schema: <schema>
  method: databricks
  host: ${DATABRICKS_HOST}
  http_path: <http_path>
  token: ${DATABRICKS_TOKEN}

Check file:

checks for <table>:
  - row_count > 0:
      name: The table is not empty
  - missing_count(column_name) = 0:
      name: Ensure there are no null values in the column_name column

I'm running the following command:

poetry run soda scan -V -d databricks -c soda/config/databricks.yml soda/checks/<table>/check.yml

And I'm getting the following error:

[10:30:27] Soda Core 3.0.38
[10:30:27] Reading configuration file "soda/config/databricks.yml"
[10:30:27] Reading SodaCL file "soda/checks/<table>/check.yml"
[10:30:27] Scan execution starts
[10:30:27] Query databricks.<table>.aggregation[0]:
SELECT
  COUNT(*),
  COUNT(CASE WHEN <column_name> IS NULL THEN 1 END)
FROM <schema>.<table>
[10:30:27] Query execution error in databricks.<table>.aggregation[0]: 'NoneType' object has no attribute 'sql'
SELECT
  COUNT(*),
  COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
  | 'NoneType' object has no attribute 'sql'
  | Stacktrace:
  | Traceback (most recent call last):
  |   File "<path>/.venv/lib/python3.10/site-packages/soda/execution/query/query.py", line 122, in fetchone
  |     cursor.execute(self.sql)
  |   File "<path>/.venv/lib/python3.10/site-packages/soda/data_sources/spark_df_cursor.py", line 15, in execute
  |     self.df = self.spark_session.sql(sqlQuery=sql)
  | AttributeError: 'NoneType' object has no attribute 'sql'

[10:30:27] Metrics 'row_count' were not computed for check 'row_count > 0'
[10:30:27] Metrics 'missing_count' were not computed for check 'missing_count(column_name) = 0'
[10:30:27] Scan summary:
[10:30:27] 1/1 query ERROR
[10:30:27]   databricks.<table>.aggregation[0] [ERROR] 0:00:00.000738
SELECT
  COUNT(*),
  COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
[10:30:27]     'NoneType' object has no attribute 'sql'
[10:30:27] 2/2 checks NOT EVALUATED:
[10:30:27]     <table> in databricks
[10:30:27]       The table is not empty [soda/checks/<table>/check.yml] [NOT EVALUATED]
[10:30:27]         check_value: None
[10:30:27]       Ensure there are no null values in the column_name column [soda/checks/<table>/check.yml] [NOT EVALUATED]
[10:30:27]         check_value: None
[10:30:27] 2 checks not evaluated.
[10:30:27] 3 errors.
[10:30:27] Oops! 3 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
[10:30:27] Query execution error in databricks.<table>.aggregation[0]: 'NoneType' object has no attribute 'sql'
SELECT
  COUNT(*),
  COUNT(CASE WHEN column_name IS NULL THEN 1 END)
FROM <schema>.<table>
  | 'NoneType' object has no attribute 'sql'
  | Stacktrace:
  | Traceback (most recent call last):
  |   File "<path>/.venv/lib/python3.10/site-packages/soda/execution/query/query.py", line 122, in fetchone
  |     cursor.execute(self.sql)
  |   File "<path>/.venv/lib/python3.10/site-packages/soda/data_sources/spark_df_cursor.py", line 15, in execute
  |     self.df = self.spark_session.sql(sqlQuery=sql)
  | AttributeError: 'NoneType' object has no attribute 'sql'

[10:30:27] Metrics 'row_count' were not computed for check 'row_count > 0'
[10:30:27] Metrics 'missing_count' were not computed for check 'missing_count(column_name) = 0'

After doing some debugging it looks like the SparkDfCursor object is not being initialized properly. The spark_session object is coming in None which is causing the error here.

Is this a bug or is there something wrong with my configuration (user error)?

drsnyder avatar May 25 '23 15:05 drsnyder

SODA-1742

jmarien avatar May 25 '23 15:05 jmarien

Hey @drsnyder spark_df type with python/pyspark. Can you try specifying the type: spark - please see https://docs.soda.io/soda/connect-spark.html#connect-to-spark-for-databricks-sql

vijaykiran avatar May 25 '23 16:05 vijaykiran

Hey @drsnyder spark_df type with python/pyspark. Can you try specifying the type: spark - please see https://docs.soda.io/soda/connect-spark.html#connect-to-spark-for-databricks-sql

If I use type: spark the client is unable to connect to the databricks server.

[11:04:13] Could not connect to data source "databricks": Encountered a problem while trying to connect to spark: Error during request to server
  | Encountered a problem while trying to connect to spark: Error during request to server

I'm not sure if it's related but the code seems to suggest that spark_df is the expected value. I've also tried using soda directly in a databricks notebook and in that context the following fails:

scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name(f"spark") # <---
scan.add_spark_session(spark)

But changing the 3rd line to scan.set_data_source_name(f"spark_df") works.

drsnyder avatar May 25 '23 16:05 drsnyder

If you are using with Python/Notebook then indeed you should use spark_df, the spark type is for connecting to Databricks via Databricks spark connector.

vijaykiran avatar May 25 '23 16:05 vijaykiran

If you are using with Python/Notebook then indeed you should use spark_df, the spark type is for connecting to Databricks via Databricks spark connector.

@vijaykiran That doesn't work for me inside databricks. As mentioned above, I have to use "spark".

This specific issue is separate-- I'm trying to use soda from the command line to validate a table in databricks. It could be that I'm doing something wrong with the configuration (help!). There could also be an issue with the code.

drsnyder avatar May 30 '23 21:05 drsnyder