dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Persist column `meta` data in delta tables

Open CICDamen opened this issue 2 years ago • 13 comments

Describe the feature

The meta key/value pairs should be persisted and pushed to the delta table when materializing dbt models.

image

Describe alternatives you've considered

The alternative is to add a custom Python model and leverage PySpark to add this metadata to a specific column df.withMetadata("column_name", {"key_1": "value_2", "key_2": value_2}). However, I would like to keep this kind of details in the .yml configuration files and avoid using additional Python models to only add metadata to columns.

Additional context

I've tried to add it to the description field since I've noticed this is persisted as a comment. However, the key name of the metadata for the column will then always be comment and all the information needs to be inputed in that same key. I would like to have several metadata keys holding different types of information.

image image

Who will this benefit?

All dbt users that leverage the power of metadata on delta tables in Databricks.

CICDamen avatar May 10 '23 15:05 CICDamen

@andrefurlan-db would appreciate your thoughts on this one.

benc-db avatar Sep 13 '23 16:09 benc-db

@casperdamen123 did some research and one option is to specify your metadata as table properties: https://docs.databricks.com/en/delta/custom-metadata.html#store-custom-tags-in-table-properties

they won't be column scoped in your yml file but you could do something like this:

models:
  - name: model_name
    config:
       tblproperties:
         - "key": "value"

where key could be something like columnName.propertyName

benc-db avatar Sep 29 '23 18:09 benc-db

@benc-db that looks promising. This is already implemented in dbt-databricks at the moment or do we need to build this?

CICDamen avatar Oct 02 '23 16:10 CICDamen

Should work already. Let me know if you hit any issues.

benc-db avatar Oct 02 '23 16:10 benc-db

@benc-db did some testing but it doesn't seem to work just yet.

version: 2

models:
  - name: test
    description: "This is a test"
  
    config: 
      tblproperties:
         - "test.type": "geometry"
    
    columns:
      - name: test
        description: "test column"

Results in:

06:12:28            create or replace table `gima_test`.`test`.`test`
06:12:28          
06:12:28          
06:12:28        using delta
06:12:28          
06:12:28          
06:12:28          
06:12:28          
06:12:28          
06:12:28          comment 'This is a test'
06:12:28          
06:12:28          
06:12:28        tblproperties ('{'test.type': 'geometry'}' = '' 
06:12:28    --------------------------^^^
06:12:28        )
06:12:28          as
06:12:28          
06:12:28    select "test" as test

When not using the table.column notation

version: 2

models:

  - name: test
    description: "This is a test"
  
    config: 
      tblproperties:
         - "some_key": "geometry"
    
    columns:
      - name: test
        description: "test column"

Results in:

06:14:30            create or replace table `gima_test`.`test`.`test`
06:14:30          
06:14:30          
06:14:30        using delta
06:14:30          
06:14:30          
06:14:30          
06:14:30          
06:14:30          
06:14:30          comment 'This is a test'
06:14:30          
06:14:30          
06:14:30        tblproperties ('{'some_key': 'geometry'}' = '' 
06:14:30    ------------------------------^^^
06:14:30        )
06:14:30          as
06:14:30          
06:14:30    select "test" as test

CICDamen avatar Oct 03 '23 06:10 CICDamen

hmm, well I know we're using it in successfully in the model files: https://github.com/databricks/dbt-databricks/blob/3609a6ed5375848ea942036b2735fb6efc76101c/tests/integration/set_tblproperties/models/set_tblproperties.sql#L4. Could you try that and let me know if you get the same error?

benc-db avatar Oct 03 '23 15:10 benc-db

Ok, so here's what's going on...it's yaml being weird. if you specify like this:

tblproperties: { "column.meta": "value" }

it should work (as opposed to the yaml expression of a dictionary).

I'll see if I can get to the bottom of this, but in the mean time, can you try that work around?

benc-db avatar Oct 05 '23 17:10 benc-db

- "some_key": "geometry" I think this is the issue...sets the value of tblproperties to be an array with a dictionary as an entry, instead of:

tblproperties:
   "some_key" : "geometry"

my mistake for giving you bad yaml.

benc-db avatar Oct 05 '23 17:10 benc-db

I'm going to update the docs on dbt to include info on how to set tblproperties.

benc-db avatar Oct 05 '23 17:10 benc-db

Tried to use the following as per your example

version: 2

models:

  - name: test
    description: "This is a test"
  
    config: 
      tblproperties:
         "test.test": "geometry"
    
    columns:
      - name: test
        description: "test column"

The model run passes but the metadata for the test column is not populated

image

CICDamen avatar Oct 12 '23 15:10 CICDamen

Hi @benc-db , the solution you proposed does not work. using tblproperties: { "column.meta": "value" } nothing populated in Data Explorer UI

csimplestring avatar Jan 18 '24 12:01 csimplestring