dbt-dry-run icon indicating copy to clipboard operation
dbt-dry-run copied to clipboard

[BUG] dbt-dry-run Ignoring Partitions from dbt-external-tables

Open betosaulo-cw opened this issue 1 year ago • 2 comments

I've encountered an issue where dbt-dry-run fails to recognize the partitions defined in a YAML configuration for a BigQuery table. Despite the correct specification of partitions under the dbt-external-tables setup, the dry run process results in a BadRequest error, indicating that the partition could not be found.

Steps to Reproduce

  1. Define a table in YAML with the following structure:
version: 2

sources:
  - name: dataset_name
    tables:
      - name: table_name
        external:
          location:  "gs://bucket/path/to/table/table_name/*.json"
          options:
            format: json
            hive_partition_uri_prefix: "gs://bucket/path/to/table/table_name"
            ignore_unknown_values: true
          partitions:
            - name: ts
              data_type: STRING
        columns: 
          - name: uuid
            data_type: STRING
          - name: name
            data_type: STRING
          ...

Observe the error: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/project_id/jobs?prettyPrint=false: Name ts not found inside a at [15:9]

betosaulo-cw avatar Dec 29 '23 13:12 betosaulo-cw

Hey 👋

Thanks for raising an issue and PR! Am I understanding this correctly that the final schema of the table is the union of the columns defined in partitions and defined in columns ? Is it possible for partitions to be defined and columns to be empty? I know that might not be very useful but the dry runner should handle that case as well if so.

ccharlesgb avatar Jan 03 '24 09:01 ccharlesgb

Hey, I hope you're doing well!

Am I understanding this correctly that the final schema of the table is the union of the columns defined in partitions and defined in columns ?

Yes, you're right. The table columns are a combination of the columns and partitions. This happens because the Hive partitions are active. So, in storage, we'll see a pattern like this: gs://bucket/path/to/table/col_partition_1=value/col_partition_2=value/file_without_col_partitions.json. In this case, if you apply a 'where' condition like col_partition_1=value, BigQuery will only access the files in this folder.

You can see examples for all the databases that dbt-external-tables support here, which seems to be the default behavior.

Is it possible for partitions to be defined and columns to be empty? I know that might not be very useful but the dry runner should handle that case as well if so.

I've never tried this specific case, but I think the code covers situations where the columns or partitions are empty. Pydantic uses a default factory for these fields, so when they are empty, the model will have an empty list of partitions and an empty dictionary of columns. Then, the unpacking should work without any issues.

betosaulo-cw avatar Jan 03 '24 14:01 betosaulo-cw