dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[CT-816] [Bug] Incremental materialization - cast column when changing column type

Open Olaktal opened this issue 2 years ago • 3 comments

Is this your first time opening an issue?

Describe the Feature

Hello,

DWH type: Redshift

In the incremental materialization, when altering column type , no cast is made.

In our model, a column type has changed from varchar to timestamp resulting in an error because we are trying to insert varchar into a timestamp column. A cast to the destination column type will solve this issue and in general, I don't see any problem in converting values systematically in the destination column type, what do you think?

Describe alternatives you've considered

Modifications:

    update {{ relation }} set {{ adapter.quote(tmp_column) }} = {{ adapter.quote(column_name) }}::{{ new_column_type }};

:: operator is postgres database specific so the modification should be made in the postgres adapter

Who will this benefit?

dbt + postgres users

Are you interested in contributing this feature?

I can try to make the modification if you are ok with my proposal I signed the CLA.

Anything else?

No response

Olaktal avatar Jul 06 '22 18:07 Olaktal

@Olaktal Thanks for opening!

Looking into this a bit (but without testing it myself), this looks like a bug to me. It would affect Redshift

Let's take a quick look at a few adapters to confirm or deny:

  • Snowflake and Spark have direct syntax for altering the data type of a column so it's implementation is very straightforward and doesn't even need a cast.
  • BigQuery requires re-writing the entire table to change the column type, and it uses a cast.
  • The default implementation inherited by Postgres, Redshift, and other database adapters doesn't contain a cast, as you pointed out.

Here would be my suggested implementation (currently un-tested), which looks a lot like your suggestion, but should work across more databases by using the standard SQL cast() syntax:

    update {{ relation }} set {{ adapter.quote(tmp_column) }} = cast({{ adapter.quote(column_name) }} as {{ new_column_type }});

As you noted, the suggested line of code would go right here.

Instructions for implementation

  • [ ] Validate the failure case locally using Redshift and/or Postgres
    1. Create two seeds files with identical column names, but different data types (timestamps, then strings)
    2. Create one incremental model that loads from the first seed and run it
    3. Update the incremental model to load from the second seed and run it
  • [ ] Create a functional test that fails for this use case (to prevent a regression in the future)
  • [ ] Implement the change suggested above

Future ideas

I looked for a cross-database cast() macro, but the closest I could find was safe_cast()

Feature request: Add a default cast() cross-database macro that Postgres and Redshift could optionally override with the :: syntax.

Edit 2023-02-07: added this feature request here:

  • https://github.com/dbt-labs/dbt-adapters/issues/84

Notes for adapter maintainers

There are three main ways to implement the alter_column_type macro:

  1. If available, use the database's built-in DDL function for changing a column's data type.
  2. Use a SQL query to select the table data, cast the relevant column, and overwrite the table.
  3. Add a new column with the desired type (using a temporary name), backfill values from the old column into the new (casting along the way!), drop the old column, and rename from the temporary column name to the old column name.

Here's a decision tree for implementing the alter_column_type macro:

  1. Does your database have a direct method to update the data type of a column (similar to Snowflake and Spark)?
    • If so, use it (method 1 above).
  2. Is adding or dropping a column an "expensive" operation that requires a full table scan (similar to BigQuery)?
    • Overwrite the table (method 2 above).
  3. None of the above (like Postgres and Redshift)?
    • Do nothing to use the default implementation (method 3 above).

dbeatty10 avatar Aug 10 '22 15:08 dbeatty10

@dbeatty10 I think you're right on the money! A cast() here should do the trick, and it's fair that a user would expect this to work.

One thing I did want to call out:

Does your database have a direct method to update the data type of a column (similar to Snowflake and Spark)? If so, use it (method 1 above).

Snowflake's built-in alter table alter type does not support changing a column's data type to another data type. From Snowflake docs:

Change a column data type to a different type (e.g. STRING to NUMBER).

In this case, the "hard refresh" of the column (default approach) would work, but the Snowflake-specific approach would not work.

Maybe dbt should opt into the "hard refresh" version if on_schema_change is set to "sync all columns" (including data type changes)? (I think that's the context for calling the alter_column_type macro in this issue.)

A balanced approach we could consider taking: snowflake__alter_column_type uses built-in alter table alter column IFF the old + new types are compatible (aliases, or just increasing length/precision), and then opts for the "hard refresh" default otherwise.

jtcohen6 avatar Aug 10 '22 17:08 jtcohen6

@jtcohen6 Thanks for expanding the context and correcting the key details!

Logically, the balanced approach makes sense (but I'm guessing would be a bit tricky in implementation).

The suggestion to opt into the "hard refresh" version if on_schema_change is set to "sync all columns" sounds a lot more straight-forward in terms of implementation.

Ancillary commentary

Regardless of the approach chosen, I'd expect the dbt run/build to fail if the old data type can't be cast() to the new data type. Does that sound right?

This could be "solved" by exposing an option to use safe_cast() instead. If we go that route, we'd want the user to opt-in to that configuration since it would fill in null values whenever the cast didn't work. There are some databases that don't support safe_cast(), and those would just fail no matter what when the cast doesn't work.

dbeatty10 avatar Aug 10 '22 17:08 dbeatty10

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Feb 07 '23 01:02 github-actions[bot]

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Aug 07 '23 01:08 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions[bot] avatar Aug 15 '23 01:08 github-actions[bot]