delta
delta copied to clipboard
[BUG] INSERT INTO with generated columns fails (not enough data columns)
When executing sql("INSERT INTO delta_gencols VALUES 1")
Delta Lake 1.2.1 fails with the following AnalysisException
:
org.apache.spark.sql.AnalysisException: Cannot write to 'default.delta_gencols', not enough data columns; target table has 2 column(s) but the inserted data has 1 column(s)
at org.apache.spark.sql.delta.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:356)
at org.apache.spark.sql.delta.DeltaAnalysis.org$apache$spark$sql$delta$DeltaAnalysis$$needsSchemaAdjustment(DeltaAnalysis.scala:322)
at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:67)
at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:64)
The table was created as follows:
import io.delta.tables.DeltaTable
import org.apache.spark.sql.types.DataTypes
val tableName = "delta_gencols"
sql(s"DROP TABLE IF EXISTS $tableName")
DeltaTable.create
.addColumn("id", DataTypes.LongType, nullable = false)
.addColumn(
DeltaTable.columnBuilder("value")
.dataType(DataTypes.BooleanType)
.generatedAlwaysAs("true")
.build)
.tableName(tableName)
.execute
I found this conversation with @zsxwing in delta-users group, but that does not seem to apply (unless I'm mistaken):
The INSERT INTO issue and CREATE TABLE syntax for generated columns are not available in OSS Delta Lake right now because Apache Spark SQL parser doesn't support them.
Spark SQL API works just fine:
spark.range(5).writeTo(tableName).append()
Environment information
- Delta Lake version: 1.2.1
- Spark version: 3.2.1
- Scala version: 2.12.15
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.14)
I'm getting this with a GENERATED IDENTITY
column as well.
For the query INSERT INTO delta_gencols VALUES 1
(delta_gencols has the id and value column as shown in the issue description), we don't support it as it's not a SQL standard. The following two SQL syntax are the standard:
-
INSERT INTO delta_gencols VALUES(1, DEFAULT)
. Spark is adding the DEFAULT keyword support and we will work with them to make it work with generated columns as well ( https://issues.apache.org/jira/browse/SPARK-38334 ) -
INSERT INTO delta_gencols(id) VALUES(1)
. Currently Spark will block such INSERT query if the user doesn't provide the entire column list. We are working on this and hope to remove the restriction so that we can make it work with generated columns in Delta.
I'm getting this problem with just a regular insertInto
in PySpark. Is this the same thing @zsxwing ?
@slyons Yep. That's the same issue.
insert into target_table (select <all cols except auto-generated one> from source_table)
also fails
insert into target_table (select
from source_table)
This is the issue I mentioned above Currently Spark will block such INSERT query if the user doesn't provide the entire column list. We are working on this and hope to remove the restriction so that we can make it work with generated columns in Delta.
@zsxwing is this a Delta or a Spark issue? If it is a Spark issue, is there already a tracking issue? I did not find any: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20"insert%20not%20enough%20data%20columns"
@keen85 SPARK-41290 is the spark side ticket. We will make sure insert into
works with Generated Columns when finishing the change in Spark side.