dbt-databricks
dbt-databricks copied to clipboard
Databricks truncates datatypes returned via `DESCRIBE EXTENDED` which is used by get_columns_in_relation()
Describe the bug
Couldn't tell you the full impact of this bug but where I encountered it was while using on_schema_change="sync_all_columns".
Basically the bug led to truncated results that feed queries involved in handling alter statements when there are data type changes in a dataset.
Current Behaviour
This is because running the below truncates the data types
DESCRIBE EXTENDED <catalog>.<schema>.<table>
Truncated field example using DESCRIBE EXTENDED
struct<_info:struct<fieldA:string,fieldB:string>,fieldC:bigint,fieldD:string>,... 78 more fields>
Requested Behaviour
Ideally dbt databricks instead uses the below to acquire that information as it does not truncate data types
select
column_name,
full_data_type,
comment
from <catalog>.information_schema.columns
where table_schema = <schema> and table_name = <table>
Steps To Reproduce
- Have a very complex (long datatype) struct field in your dataset
- Run any operation in
dbt-databricksthat looks up the datatype of that field viaget_columns_in_relation() - You will observe the struct field you created has truncated datatype
Expected Behaviour
- Have a very complex (long datatype) struct field in your dataset
- Run any operation in
dbt-databricksthat looks up the datatype of that field viaget_columns_in_relation() - You will observe the struct field you created does not have a truncated datatype
Screenshots and log output
If applicable, add screenshots or log output to help explain your problem.
System information
Core:
- installed: 1.8.5
- latest: 1.8.5 - Up to date!
Plugins:
- databricks: 1.8.5 - Up to date!
- spark: 1.8.0 - Up to date!
Additional context
- Also causes problems when using dbt codegen as it utilizes the adapter to lookup the datatypes of columns
Thanks for reporting, will investigate
Need to reopen the issue. I tried to implement the suggested fix and discovered that there is often sync latency between UC and Delta that causes the information_schema to be out of date. I can fix that issue by forcing sync, but only if the table is delta; The fix is more complicated that I originally implemented, so reopening this issue.