delta icon indicating copy to clipboard operation
delta copied to clipboard

Confusing discrepancy between `replace` and `createOrReplace`.

Open madten opened this issue 4 years ago • 6 comments

If I create a table like this:

import io.delta.tables._

DeltaTable
  .create(spark)
  .location("/tmp/foo")
  .addColumn("x", "integer")
  .execute()

I can use replace with different schema:

DeltaTable
  .replace(spark)
  .location("/tmp/foo")
  .addColumn("y", "string")
  .execute()

which will execute just fine and replace schema.

However, if I try createOrReplace:

DeltaTable
  .createOrReplace(spark)
  .location("/tmp/foo")
  .addColumn("z", "boolean")
  .execute()

it fails with

  org.apache.spark.sql.AnalysisException: The specified schema does not match the existing schema at /tmp/foo.

  == Specified ==
  root
  -- z: boolean (nullable = true)


  == Existing ==
  root
  -- y: string (nullable = true)


  == Differences==
  - Specified schema is missing field(s): y
  - Specified schema has additional field(s): z

  If your intention is to keep the existing schema, you can omit the
  schema from the create table command. Otherwise please ensure that
  the schema matches.

Personally, I find this behavior confusing and would expect that both would behave the same, as long as table exists.

To make it even more confusing, this behavior changes if tableName is used instead of location. The following passes just fine:

import io.delta.tables._

DeltaTable
  .create(spark)
  .tableName("foo")
  .addColumn("x", "integer")
  .execute()

DeltaTable
  .replace(spark)
  .tableName("foo")
  .addColumn("y", "string")
  .execute()

DeltaTable
  .createOrReplace(spark)
  .tableName("foo")
  .addColumn("z", "boolean")
  .execute()

The difference seems to be caused by the following piece

https://github.com/delta-io/delta/blob/0f30f6f47761cb01043d84aad182f30cb59b5e31/core/src/main/scala/org/apache/spark/sql/delta/commands/CreateDeltaTableCommand.scala#L194-L199

which treats delta on path with createOrReplace, as a special case, and applies createTransactionLogOrVerify(), not used in case of table.

madten avatar Nov 11 '21 14:11 madten

Hi @madten, thanks for making this issue! Would you like to make a PR to implement your suggested fix, as well as test cases to verify it?

scottsand-db avatar Nov 12 '21 17:11 scottsand-db

Hi @scottsand-db. Sorry for a delayed response. I'd be happy to, but, it is the first time I am looking at the Delta source and I'd like to get some feedback from @brkyvz first.

I checked the original commit (https://github.com/delta-io/delta/commit/5cc383496b35905d3b7911a1f3418777156464c9) but it is not exactly clear from it, why such clause was deemed to be necessary.

madten avatar Dec 04 '21 22:12 madten

This does seem to be a bug. I would expect that tableExists returns true if indeed that's where the exception is thrown

brkyvz avatar Dec 05 '21 04:12 brkyvz

if indeed that's where the exception is thrown

Traceback starts with

  at org.apache.spark.sql.delta.DeltaErrors$.createTableWithDifferentSchemaException(DeltaErrors.scala:763)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.verifyTableMetadata(CreateDeltaTableCommand.scala:288)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.createTransactionLogOrVerify$1(CreateDeltaTableCommand.scala:180)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.$anonfun$run$2(CreateDeltaTableCommand.scala:194)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand$$Lambda$5173/0000000000000000.apply(Unknown Source)
  at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)
  at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:48)
  at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:106)
  at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:91)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.recordDeltaOperation(CreateDeltaTableCommand.scala:48)
  at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.run(CreateDeltaTableCommand.scala:104)
  at org.apache.spark.sql.delta.catalog.DeltaCatalog.org$apache$spark$sql$delta$catalog$DeltaCatalog$$createDeltaTable(DeltaCatalog.scala:147)

which seems to point to CreateDeltaTableCommand.scala:194.

I would expect that tableExists returns true

This is actually confusing, because tableExists is define by querying the catalog, so it makes sense that it is false if we just have dataset written in delta format, not saves as table, right?

madten avatar Dec 05 '21 20:12 madten

Having exactly the same problem. In addition, using replace automatically truncates the table. I would like to avoid having to write ALTER TABLE statements in SQL, but looks like for now it's the only alternative.

pgrandjean avatar May 02 '24 14:05 pgrandjean

I found the following solution:

import org.apache.spark.sql.connector.catalog.{Identifier, TableCatalog, TableChange}

// the following configuration setting is not needed in Databricks
val spark = SparkSession.builder
          .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
          .getOrCreate()

val tableCatalog = spark.sessionState.catalogManager.catalog("spark_catalog").asInstanceOf[TableCatalog]
tableCatalog.alterTable(
  Identifier.of(Array("default"), "foo"), // changing table default.foo
  TableChange.updateColumnComment(Array("bar"), "hello") // changing column/field bar to add/update comment
)

TableChange has methods to change a Table or its columns, like:

  • setProperty
  • addColumn
  • deleteColumn
  • updateColumn
  • etc

The table history looks like this:

+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+
|version|           timestamp|userId|userName|    operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics|userMetadata|          engineInfo|
+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+
|      2|2024-05-03 21:36:...|  null|    null|CHANGE COLUMN|{column -> {"name...|null|    null|     null|          1|  Serializable|         true|              {}|        null|Apache-Spark/3.4....|
|      1|2024-05-03 21:36:...|  null|    null|CHANGE COLUMN|{column -> {"name...|null|    null|     null|          0|  Serializable|         true|              {}|        null|Apache-Spark/3.4....|
|      0|2024-05-03 21:36:...|  null|    null| CREATE TABLE|{isManaged -> fal...|null|    null|     null|       null|  Serializable|         true|              {}|        null|Apache-Spark/3.4....|
+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+

pgrandjean avatar May 03 '24 21:05 pgrandjean