dbt-spark
dbt-spark copied to clipboard
[ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method
Is this a new bug in dbt-spark?
- [X] I believe this is a new bug in dbt-spark
- [X] I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
Null values in csv seeds appear as string "None" or empty string in the materialized table when using Spark Session connection or the Spark Connect method in development on #899
Expected Behavior
Null values in seeds should be materialized as null values.
Steps To Reproduce
- Create a seed with null values
seed_null.csv
:
type,value1,value2
empty,,
doublequotes,""," "
ticks,'',''
spaces, ,
stringnone,None,None
stringnull,Null,Null
nothing
- Create a model for testing null values:
test_null.sql
:
SELECT
type,
value1,
CASE
WHEN value1 IS NULL THEN 'Actual NULL'
WHEN value1 = 'None' THEN 'String "None"'
WHEN value1 = '' THEN 'Empty String'
ELSE 'Other'
END AS value1_type,
value2,
CASE
WHEN value2 IS NULL THEN 'Actual NULL'
WHEN value2 = 'None' THEN 'String "None"'
WHEN value2 = '' THEN 'Empty String'
ELSE 'Other'
END as value2_type
FROM {{ ref('seed_null') }}
- Run the model:
- Query the created table:
spark.sql("SELECT * from test_null").show()
+------------+------+-------------+------+-------------+
| type|value1| value1_type|value2| value2_type|
+------------+------+-------------+------+-------------+
| nothing| None|String "None"| None|String "None"|
|doublequotes| None|String "None"| None|String "None"|
| empty| None|String "None"| None|String "None"|
| stringnone| None|String "None"| None|String "None"|
| stringnull| None|String "None"| None|String "None"|
| spaces| None|String "None"| None|String "None"|
| ticks| | Empty String| | Empty String|
+------------+------+-------------+------+-------------+
Relevant log output
No response
Environment
- OS:
- Python: 3.11.5
- dbt-core: 1.6.5
- dbt-spark: 1.6.0
Additional Context
No response
Thanks for reaching out @vakarisbk ! And thanks for providing such a well-constructed example 🏆
Could you try this?
dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10
This worked for me, and here's the output that I got:
16:35:50 Running with dbt=1.6.5
16:35:50 Registered adapter: spark=1.6.2
16:35:50 Unable to do partial parsing because a project config has changed
16:35:51 Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:35:51
16:35:55 Concurrency: 1 threads (target='http')
16:35:55
16:35:55 1 of 2 START seed file dbt_dbeatty.seed_null ................................... [RUN]
16:36:03 1 of 2 OK loaded seed file dbt_dbeatty.seed_null ............................... [CREATE 7 in 7.76s]
16:36:03 2 of 2 START sql view model dbt_dbeatty.test_null .............................. [RUN]
16:36:05 2 of 2 OK created sql view model dbt_dbeatty.test_null ......................... [OK in 2.42s]
16:36:06
16:36:06 Finished running 1 seed, 1 view model in 0 hours 0 minutes and 15.45 seconds (15.45s).
16:36:06
16:36:06 Completed successfully
16:36:06
16:36:06 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
16:36:08 Running with dbt=1.6.5
16:36:08 Registered adapter: spark=1.6.2
16:36:08 Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:36:08
16:36:11 Concurrency: 1 threads (target='http')
16:36:11
16:36:13 Previewing inline node:
| type | value1 | value1_type | value2 | value2_type |
| ------------ | ------ | ------------- | ------ | ------------- |
| doublequotes | | Actual NULL | | Actual NULL |
| empty | | Actual NULL | | Actual NULL |
| nothing | | Actual NULL | | Actual NULL |
| spaces | | Actual NULL | | Actual NULL |
| stringnone | None | String "None" | None | String "None" |
| stringnull | | Actual NULL | | Actual NULL |
| ticks | '' | Other | '' | Other |
Since this looks like it's working, I'm going to close this as "can't reproduce". But please let me know if you still think this is an issue and we can consider re-opening this.
@dbeatty10 thanks for your response.
I can see from your logs that you're using target=http
which I assume may be connection to a Databricks cluster. This issue only appears in the session
connection (local pyspark).
I just re-ran the tests and got my originally reported results back.
Thank you for that insight @vakarisbk -- will re-open this!
Nice eyes on http
vs. session
🦅
Reprex
Using the example here, the following commands do give the expected output when using the ✅ http
connection method, but give unexpected output when using the ❌ session
method:
dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10
Acceptance criteria
- There are tests within dbt-adapters for various seed values that are expected to translate to
null
within the database. - These tests are inherited by the dbt-spark adapter.
- These tests pass when the
session
connection method is used.
Summary
We are unable to prioritize this ourselves, so I'm going to lable this as "help wanted".