delta
delta copied to clipboard
[Feature Request] SQL syntax for GENERATED columns in OSS
Bug
Using SQL it is not possible to CREATE
or INSERT
into a table with GENERATED
columns with spark 3.2.1 and delta 1.2.0.
Describe the problem
Steps to reproduce
- download latest spark-3.2.1-bin-hadoop3.2-scala2.13
- start with
bin/spark-sql --packages io.delta:delta-core_2.13:1.2.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
- try creating a table with a generated column:
CREATE TABLE `everts` (
event_time TIMESTAMP NOT NULL,
event_month DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
)
USING DELTA
;
- try
INSERT INTO
INSERT INTO TABLE default.events (event_time)
VALUES (timestamp(now()))
;
Observed results
neither the CREATE
nor the INSERT
(for the case the table was created with DeltaTableBuilder
) would succeed.
Expected results
Would be really great if the Delta OSS supports more of the feature announced by Databricks (otherwise it would help to communicate explicitly which features are Databricks exclusive and which are also available in the Delta OSS).
Further details
Could it be that the SQL GENERATED column syntax is supported on databricks only? Or am I missing something?
I cannot find any 'GENERATED ALWAYS ...' related grammar in DeltaSqlBase.g4
either?
Environment information
- Delta Lake version: 1.2.0
- Spark version: 3.2.1
- Scala version: 2.13.5
ok, I found this https://groups.google.com/g/delta-users/c/x2xnHkysIZw - CREATE and INSERT SQL syntax is not supported
Thanks for creating this issue. As mentioned in the email group here, it requires upgrade to Spark 3.4.
Thanks @vkorukanti, I don't see a reference to Spark 3.4 in the above link. Building the current spark master branch (3.4.0-SNAPSHOT) doesn't seem to help either. I guess for the GENERATED column syntax to work, we'll first need to see a change in DeltaSqlParser.scala or SqlBaseParser.sql.
@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.
@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.
Thank you for doing that! Do you have a link to that community discussion? Very eager to get this in!
just as a side note - it is interesting to note, that Apache Iceberg has a similar/related feature "partitioning transform" which uses syntax like:
...
PARTITIONED BY (months(ts), ...)
...
with months()
being a partition transform.
(I guess, very often the GENERATED columns in delta will be used for defining a particular partitioning; in iceberg the syntax for this use case is less explicit, but seems to fit nicely - i.e. if you only need the GENERATED column to set the partitioning, you might not want it to be shown in a SELECT *
; using a PARTITIONED BY (transform(column))
is also quite SQL-idiomatic - this might be worth considering, while discussing the SQL syntax changes with the Spark team).
@kpe Can you clarify something? are you saying that when iceberg uses PARTITIONED BY (transform(column))
, then SELECT *
does not show that generated partition column? Then how do users inspect that column value if something has gone wrong in that generation of its value?
In general, we designed Generated Column as the more general solution to such things that can be used on any column and any expressions, not just for a limited set of expressions and only for partitioning. This approach, along with the GENERATED ALWAYS AS syntax is actually SQL standard compliant (see Mysql, Postgres). APIs that adhere to long-standing SQL standards is usually more SQL-idiomatic than system-specific syntax. It ensures zero-learning curve for any SQL user who are already familiar with such standards when using other SQL engines. This will be our argument for accepting this syntax in the Spark. Finger's crossed that this works!
@tdas Thank you for the explanation! It all make sense, and I'm looking forward to use the GENERATED syntax in delta open source soon.
(without being an iceberg advocate or even a big data practitioner, I guess, iceberg's motivation around their partition handling, is to allow for queries without explicitly specified partitioning information, which could be redundant and error-prone, while allowing for partition schema evolution without breaking existing queries )
@kpe I see! Thank you for giving that additional information! Definitely food for thought for us.
can we expect the SQL syntax for GENERATED columns to become available in Delta 2.0? (now that delta is to be fully open sourced)
Hey @kpe as mentioned earlier in this thread this requires changes in Spark. Making those necessary changes is on our roadmap but we will still have to wait for the next Spark release and upgrade our version.
hi @allisonport-db & @zsxwing any updates concerning including this into Spark 3.4?
Working to get this into 3.4 tracking via SPARK-41290
SPARK-41290 closed 🔥
SPARK-41290 also contains changes required to https://github.com/delta-io/delta/issues/1072?
spark 3.4 supports generated always as
syntax, but delta 2.4 doesn't support it yet, when will it get support?
Any update on this issue, In latest Delta 3.0 also this issue not fixed.