terraform-provider-databricks icon indicating copy to clipboard operation
terraform-provider-databricks copied to clipboard

[ISSUE] Issue with `databricks_sql_table` resource during table schema evolution

Open vsluc opened this issue 1 year ago • 3 comments

When a new column is added to the column configuration in the end, the databricks_sql_table resource recreates a table instead of altering the table and adding the new column.

The table was first created using the same table configuration provided below except that name column was not there.

Configuration

resource "databricks_sql_table" "test_managed" {
  name               = "test_managed"
  catalog_name       = "my_catalog"
  schema_name        = "my_schema"
  table_type         = "MANAGED"
  data_source_format = "DELTA"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "sid"
    type = "string"
  }
  column {
    name = "name"
    type = "string"
  }  

  properties = {
    "delta.autoOptimize.autoCompact" = true,
    "delta.autoOptimize.optimizeWrite" = true,    
    "delta.enableChangeDataFeed" = true
  }
}

Expected Behavior

Any new column added in the end, should be treated as an ALTER to the table and the new column is added without destroying & recreating the table with the new column. The recreation causes the data in the table to be deleted, which is undesirable. When using SQL interface (on a SQL warehouse) to do the same operation of adding a new column in the end, it just ALTERs the table and adds the new column in the end. The data is retained as well. We should have the same behavior in terraform as well.

Actual Behavior

The table gets destroyed and recreated. The data in the table is lost.

Steps to Reproduce

  1. Create a table with 2 columns id, sid:
resource "databricks_sql_table" "test_managed" {
  name               = "test_managed"
  catalog_name       = "my_catalog"
  schema_name        = "my_schema"
  table_type         = "MANAGED"
  data_source_format = "DELTA"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "sid"
    type = "string"
  }

  properties = {
    "delta.autoOptimize.autoCompact" = true,
    "delta.autoOptimize.optimizeWrite" = true,    
    "delta.enableChangeDataFeed" = true
  }
}
  1. Insert sample rows:
insert into table my_catalog.my_schema.test_managed
values
(1,"sid_1")
,(2,"sid_2");
  1. Alter the table by adding a new column name at the end:
resource "databricks_sql_table" "test_managed" {
  name               = "test_managed"
  catalog_name       = "my_catalog"
  schema_name        = "my_schema"
  table_type         = "MANAGED"
  data_source_format = "DELTA"

  column {
    name = "id"
    type = "int"
  }
  column {
    name = "sid"
    type = "string"
  }
  column {
    name = "name"
    type = "string"
  }  

  properties = {
    "delta.autoOptimize.autoCompact" = true,
    "delta.autoOptimize.optimizeWrite" = true,    
    "delta.enableChangeDataFeed" = true
  }
}
  1. The plan shows # forces replacement due to column addition:
21:44:55    # databricks_sql_table.test_managed must be replaced
21:44:55  -/+ resource "databricks_sql_table" "test_managed" {
21:44:55        + cluster_id         = (known after apply)
21:44:55        ~ id                 = "my_catalog.my_schema.test_managed" -> (known after apply)
21:44:55          name               = "test_managed"
21:44:55        ~ properties         = {
21:44:55            - "delta.lastCommitTimestamp"        = "1701920126000" -> null
21:44:55            - "delta.lastUpdateVersion"          = "0" -> null
21:44:55            - "delta.minReaderVersion"           = "1" -> null
21:44:55            - "delta.minWriterVersion"           = "4" -> null
21:44:55              # (3 unchanged elements hidden)
21:44:55          }
21:44:55        - storage_location   = "<redacted>" -> null
21:44:55          # (4 unchanged attributes hidden)
21:44:55  
21:44:55        + column { # forces replacement
21:44:55            + name     = "name"
21:44:55            + nullable = true
21:44:55            + type     = "string"
21:44:55          }
21:44:55  
21:44:55          # (2 unchanged blocks hidden)
21:44:55      }
  1. Upon apply the table is destroyed and a new table is created in a new s3 prefix.
  2. The data is lost as well. The below query results 0 rows:
select * from my_catalog.my_schema.test_managed;

NOTE: When a similar attempt is made on an external table, the table gets destroyed. But, it fails with the error below upon apply :

21:48:18  │ Error: cannot create sql table: cannot execute CREATE EXTERNAL TABLE `my_catalog`.`my_schema`.`test_external` (id int, sid string, name string)
21:48:18  │ USING DELTA
21:48:18  │ TBLPROPERTIES ('delta.autoOptimize.autoCompact'='true', 'delta.autoOptimize.optimizeWrite'='true', 'delta.enableChangeDataFeed'='true')
21:48:18  │ LOCATION '<REDACTED>';: The specified schema does not match the existing schema at <REDACTED>.
21:48:18  │ 
21:48:18  │   with databricks_sql_table.test_external,
21:48:18  │   on table-test.tf line 28, in resource "databricks_sql_table" "test_external":
21:48:18  │   28: resource "databricks_sql_table" "test_external" {

Terraform and provider versions

Terraform v1.5.7

provider registry.terraform.io/databricks/databricks v1.31.0
provider registry.terraform.io/hashicorp/aws v4.67.0
provider registry.terraform.io/hashicorp/local v2.4.0

Is it a regression?

Unknown

vsluc avatar Dec 07 '23 04:12 vsluc

@vsluc yes, this is a gap in the databricks_sql_table resource at the moment, where changes to columns are handled as re-creation

Additionally when creating an external table, the Databricks Runtime checks that the schema specified in the CTAS statement & the existing schema matches, and will throw an error otherwise. One way to handle this is to provide a parameter to not send the schema to the runtime

nkvuong avatar Dec 08 '23 16:12 nkvuong

+1

henryhueske avatar Jan 16 '24 15:01 henryhueske

Having the same issue here.

CarreauClement avatar Feb 05 '24 14:02 CarreauClement

Please fix this

Dedvall avatar Feb 22 '24 11:02 Dedvall

It would be nice to have this fixed

mskalicky-cint avatar Feb 28 '24 17:02 mskalicky-cint

Please fix this issue

SathwikaJ avatar Feb 28 '24 17:02 SathwikaJ

Hello. @nkvuong, any news on this issue? This is very blocking. Thanks

alexandremoyrand avatar Mar 07 '24 10:03 alexandremoyrand

The issue has been fixed with the following PRs: https://github.com/databricks/terraform-provider-databricks/pull/3359 https://github.com/databricks/terraform-provider-databricks/pull/3381

They will go out with the next release.

edwardfeng-db avatar Apr 03 '24 11:04 edwardfeng-db

Thanks @edwardfeng-db . Much anticipate change.

vsluc avatar Apr 04 '24 22:04 vsluc