dbt-databricks
dbt-databricks copied to clipboard
syntax for materializing in external locations
I recently switched my project to unity catalog, but setting the +location_root in the projects.yml isn't affecting where or how schemas get created. databricks-dbt created the schema as a 'managed' schema. views were created ok but it threw an error when I tried to materialize a table
No parent external location found for path 'abfss://abc@profilelocationroot/schema/model'
2 questions Does the databricks connector support creation of 'external' schemas or do I need to manually do it prior to dbt run?
second question - I want objects/models to be created with the owner set to the metastore admin group rather than my username. Is there a keyword for that property that I can config in a yml file?
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.
hey @jakub-auger,
we workaround this by schema level with overriding spark__create_schema macro like;
{% macro spark__create_schema(relation) -%}
{% set s3_path_pattern = '^s3://([a-z0-9-_]+)/([a-z0-9-_]+)/$' %}
{%- set s3_schema_location = var("default_schema_location_root") ~'/'~relation.quote(schema=false)~'/' -%}
{% if not modules.re.match(s3_path_pattern, s3_schema_location) %}
{%- do exceptions.raise_compiler_error(
'[' ~ s3_schema_location ~ '] is not a valid s3 path'
) -%}
{% endif %}
{%- call statement('create_schema') -%}
create schema if not exists {{ relation }}
location '{{ s3_schema_location }}'
{% endcall %}
{% endmacro %}
dbt_project.yml
vars:
default_schema_location_root:
"{%- if target.name == 'prod' -%} s3://my-datasets
{%- elif target.name == 'ci' -%} s3://my-dev-datasets
{%- elif target.name == 'dev' -%} s3://my-dev-datasets
{%- else -%} {%- do exceptions.raise_compiler_error(target.name ~ ' not a known environment!') -%}
{%- endif -%}"
this makes new tables still "managed" but in an external location. after this "managed" means, a "drop table" statement will purge underlying files in s3.
although "managed in s3" is just fine for Dev + CI data. We wanted to have real external tables for Snapshots, because recovering deleted files from s3 is not straightforward in case any "drop table" happens on Snapshots.
to do this, we also override location_clause macro and added location_root to Snapshots node in dbt_project.yml
location_clause macro:
{#
Renders a location_clause() when a `location_root` provided in `dbt_project.yml`
this macro slightly overrides dbt_spark's default, this override ignores empty strings,
with another saying, omitting property and providing empty string acts same.
see original macro (here)[https://github.com/dbt-labs/dbt-spark/blob/933a9def6209612aa83c10e2775786ff448b3999/dbt/include/spark/macros/adapters.sql#L28-L34]
#}
{% macro location_clause() %}
{%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
{%- set identifier = model['alias'] -%}
{%- if location_root is not none and location_root | trim != '' %}
location '{{ location_root }}/{{ identifier }}'
{%- endif %}
{%- endmacro -%}
dbt_project.yml
snapshots:
my_project:
# snapshot tables will be "managed" on "dev" targets, and "external" on "ci" and "prod" targets.
+location_root:
"{%- if target.name == 'prod' -%} s3://my-datasets/raw_snapshots
{%- elif target.name == 'ci' -%} s3://my-dev-datasets/{{ target.schema }}_raw_snapshots
{%- endif -%}"
@jakub-auger , lastly, are you able to remove stale label from this issue?
@ktopcuoglu does the work around you provide work with Unity Catalog? My understanding was that external tables were mutually exclusive with managed...
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.