dbt-spark icon indicating copy to clipboard operation
dbt-spark copied to clipboard

[ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method

Open vakarisbk opened this issue 1 year ago • 3 comments

Is this a new bug in dbt-spark?

  • [X] I believe this is a new bug in dbt-spark
  • [X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Null values in csv seeds appear as string "None" or empty string in the materialized table when using Spark Session connection or the Spark Connect method in development on #899

Expected Behavior

Null values in seeds should be materialized as null values.

Steps To Reproduce

  1. Create a seed with null values

seed_null.csv:

type,value1,value2
empty,,
doublequotes,""," "
ticks,'',''
spaces, ,
stringnone,None,None
stringnull,Null,Null
nothing
  1. Create a model for testing null values:

test_null.sql:

SELECT
  type,
  value1,
  CASE
    WHEN value1 IS NULL THEN 'Actual NULL'
    WHEN value1 = 'None' THEN 'String "None"'
    WHEN value1 = '' THEN 'Empty String'
    ELSE 'Other'
  END AS value1_type,
  value2,
  CASE
    WHEN value2 IS NULL THEN 'Actual NULL'
    WHEN value2 = 'None' THEN 'String "None"'
    WHEN value2 = '' THEN 'Empty String'
    ELSE 'Other'
  END as value2_type
FROM {{ ref('seed_null') }}
  1. Run the model:
  2. Query the created table:
spark.sql("SELECT * from test_null").show()

+------------+------+-------------+------+-------------+
|        type|value1|  value1_type|value2|  value2_type|
+------------+------+-------------+------+-------------+
|     nothing|  None|String "None"|  None|String "None"|
|doublequotes|  None|String "None"|  None|String "None"|
|       empty|  None|String "None"|  None|String "None"|
|  stringnone|  None|String "None"|  None|String "None"|
|  stringnull|  None|String "None"|  None|String "None"|
|      spaces|  None|String "None"|  None|String "None"|
|       ticks|      | Empty String|      | Empty String|
+------------+------+-------------+------+-------------+

Relevant log output

No response

Environment

- OS:
- Python: 3.11.5
- dbt-core: 1.6.5
- dbt-spark: 1.6.0

Additional Context

No response

vakarisbk avatar Oct 04 '23 16:10 vakarisbk

Thanks for reaching out @vakarisbk ! And thanks for providing such a well-constructed example 🏆

Could you try this?

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

This worked for me, and here's the output that I got:

16:35:50  Running with dbt=1.6.5
16:35:50  Registered adapter: spark=1.6.2
16:35:50  Unable to do partial parsing because a project config has changed
16:35:51  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:35:51  
16:35:55  Concurrency: 1 threads (target='http')
16:35:55  
16:35:55  1 of 2 START seed file dbt_dbeatty.seed_null ................................... [RUN]
16:36:03  1 of 2 OK loaded seed file dbt_dbeatty.seed_null ............................... [CREATE 7 in 7.76s]
16:36:03  2 of 2 START sql view model dbt_dbeatty.test_null .............................. [RUN]
16:36:05  2 of 2 OK created sql view model dbt_dbeatty.test_null ......................... [OK in 2.42s]
16:36:06  
16:36:06  Finished running 1 seed, 1 view model in 0 hours 0 minutes and 15.45 seconds (15.45s).
16:36:06  
16:36:06  Completed successfully
16:36:06  
16:36:06  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
16:36:08  Running with dbt=1.6.5
16:36:08  Registered adapter: spark=1.6.2
16:36:08  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:36:08  
16:36:11  Concurrency: 1 threads (target='http')
16:36:11  
16:36:13  Previewing inline node:
| type         | value1 | value1_type   | value2 | value2_type   |
| ------------ | ------ | ------------- | ------ | ------------- |
| doublequotes |        | Actual NULL   |        | Actual NULL   |
| empty        |        | Actual NULL   |        | Actual NULL   |
| nothing      |        | Actual NULL   |        | Actual NULL   |
| spaces       |        | Actual NULL   |        | Actual NULL   |
| stringnone   | None   | String "None" | None   | String "None" |
| stringnull   |        | Actual NULL   |        | Actual NULL   |
| ticks        | ''     | Other         | ''     | Other         |

Since this looks like it's working, I'm going to close this as "can't reproduce". But please let me know if you still think this is an issue and we can consider re-opening this.

dbeatty10 avatar Feb 07 '24 16:02 dbeatty10

@dbeatty10 thanks for your response. I can see from your logs that you're using target=http which I assume may be connection to a Databricks cluster. This issue only appears in the session connection (local pyspark).

I just re-ran the tests and got my originally reported results back.

vakarisbk avatar Feb 07 '24 17:02 vakarisbk

Thank you for that insight @vakarisbk -- will re-open this!

Nice eyes on http vs. session 🦅

Reprex

Using the example here, the following commands do give the expected output when using the ✅ http connection method, but give unexpected output when using the ❌ session method:

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

Acceptance criteria

  • There are tests within dbt-adapters for various seed values that are expected to translate to null within the database.
  • These tests are inherited by the dbt-spark adapter.
  • These tests pass when the session connection method is used.

Summary

We are unable to prioritize this ourselves, so I'm going to lable this as "help wanted".

dbeatty10 avatar Feb 09 '24 15:02 dbeatty10