Confusing discrepancy between `replace` and `createOrReplace`.
If I create a table like this:
import io.delta.tables._
DeltaTable
.create(spark)
.location("/tmp/foo")
.addColumn("x", "integer")
.execute()
I can use replace with different schema:
DeltaTable
.replace(spark)
.location("/tmp/foo")
.addColumn("y", "string")
.execute()
which will execute just fine and replace schema.
However, if I try createOrReplace:
DeltaTable
.createOrReplace(spark)
.location("/tmp/foo")
.addColumn("z", "boolean")
.execute()
it fails with
org.apache.spark.sql.AnalysisException: The specified schema does not match the existing schema at /tmp/foo.
== Specified ==
root
-- z: boolean (nullable = true)
== Existing ==
root
-- y: string (nullable = true)
== Differences==
- Specified schema is missing field(s): y
- Specified schema has additional field(s): z
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.
Personally, I find this behavior confusing and would expect that both would behave the same, as long as table exists.
To make it even more confusing, this behavior changes if tableName is used instead of location. The following passes just fine:
import io.delta.tables._
DeltaTable
.create(spark)
.tableName("foo")
.addColumn("x", "integer")
.execute()
DeltaTable
.replace(spark)
.tableName("foo")
.addColumn("y", "string")
.execute()
DeltaTable
.createOrReplace(spark)
.tableName("foo")
.addColumn("z", "boolean")
.execute()
The difference seems to be caused by the following piece
https://github.com/delta-io/delta/blob/0f30f6f47761cb01043d84aad182f30cb59b5e31/core/src/main/scala/org/apache/spark/sql/delta/commands/CreateDeltaTableCommand.scala#L194-L199
which treats delta on path with createOrReplace, as a special case, and applies createTransactionLogOrVerify(), not used in case of table.
Hi @madten, thanks for making this issue! Would you like to make a PR to implement your suggested fix, as well as test cases to verify it?
Hi @scottsand-db. Sorry for a delayed response. I'd be happy to, but, it is the first time I am looking at the Delta source and I'd like to get some feedback from @brkyvz first.
I checked the original commit (https://github.com/delta-io/delta/commit/5cc383496b35905d3b7911a1f3418777156464c9) but it is not exactly clear from it, why such clause was deemed to be necessary.
This does seem to be a bug. I would expect that tableExists returns true if indeed that's where the exception is thrown
if indeed that's where the exception is thrown
Traceback starts with
at org.apache.spark.sql.delta.DeltaErrors$.createTableWithDifferentSchemaException(DeltaErrors.scala:763)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.verifyTableMetadata(CreateDeltaTableCommand.scala:288)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.createTransactionLogOrVerify$1(CreateDeltaTableCommand.scala:180)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.$anonfun$run$2(CreateDeltaTableCommand.scala:194)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand$$Lambda$5173/0000000000000000.apply(Unknown Source)
at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)
at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:48)
at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:106)
at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:91)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.recordDeltaOperation(CreateDeltaTableCommand.scala:48)
at org.apache.spark.sql.delta.commands.CreateDeltaTableCommand.run(CreateDeltaTableCommand.scala:104)
at org.apache.spark.sql.delta.catalog.DeltaCatalog.org$apache$spark$sql$delta$catalog$DeltaCatalog$$createDeltaTable(DeltaCatalog.scala:147)
which seems to point to CreateDeltaTableCommand.scala:194.
I would expect that
tableExistsreturnstrue
This is actually confusing, because tableExists is define by querying the catalog, so it makes sense that it is false if we just have dataset written in delta format, not saves as table, right?
Having exactly the same problem. In addition, using replace automatically truncates the table. I would like to avoid having to write ALTER TABLE statements in SQL, but looks like for now it's the only alternative.
I found the following solution:
import org.apache.spark.sql.connector.catalog.{Identifier, TableCatalog, TableChange}
// the following configuration setting is not needed in Databricks
val spark = SparkSession.builder
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
val tableCatalog = spark.sessionState.catalogManager.catalog("spark_catalog").asInstanceOf[TableCatalog]
tableCatalog.alterTable(
Identifier.of(Array("default"), "foo"), // changing table default.foo
TableChange.updateColumnComment(Array("bar"), "hello") // changing column/field bar to add/update comment
)
TableChange has methods to change a Table or its columns, like:
- setProperty
- addColumn
- deleteColumn
- updateColumn
- etc
The table history looks like this:
+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+
|version| timestamp|userId|userName| operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics|userMetadata| engineInfo|
+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+
| 2|2024-05-03 21:36:...| null| null|CHANGE COLUMN|{column -> {"name...|null| null| null| 1| Serializable| true| {}| null|Apache-Spark/3.4....|
| 1|2024-05-03 21:36:...| null| null|CHANGE COLUMN|{column -> {"name...|null| null| null| 0| Serializable| true| {}| null|Apache-Spark/3.4....|
| 0|2024-05-03 21:36:...| null| null| CREATE TABLE|{isManaged -> fal...|null| null| null| null| Serializable| true| {}| null|Apache-Spark/3.4....|
+-------+--------------------+------+--------+-------------+--------------------+----+--------+---------+-----------+--------------+-------------+----------------+------------+--------------------+