delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request] SQL syntax for GENERATED columns in OSS

Open kpe opened this issue 2 years ago • 11 comments

Bug

Using SQL it is not possible to CREATE or INSERT into a table with GENERATED columns with spark 3.2.1 and delta 1.2.0.

Describe the problem

Steps to reproduce

  1. download latest spark-3.2.1-bin-hadoop3.2-scala2.13
  2. start with
bin/spark-sql --packages io.delta:delta-core_2.13:1.2.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
  1. try creating a table with a generated column:
CREATE TABLE `everts` (
 event_time TIMESTAMP NOT NULL,
 event_month DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
)
USING DELTA
;
  1. try INSERT INTO
INSERT INTO TABLE default.events (event_time) 
VALUES (timestamp(now()))
;

Observed results

neither the CREATE nor the INSERT (for the case the table was created with DeltaTableBuilder) would succeed.

Expected results

Would be really great if the Delta OSS supports more of the feature announced by Databricks (otherwise it would help to communicate explicitly which features are Databricks exclusive and which are also available in the Delta OSS).

Further details

Could it be that the SQL GENERATED column syntax is supported on databricks only? Or am I missing something? I cannot find any 'GENERATED ALWAYS ...' related grammar in DeltaSqlBase.g4 either?

Environment information

  • Delta Lake version: 1.2.0
  • Spark version: 3.2.1
  • Scala version: 2.13.5

kpe avatar Apr 27 '22 01:04 kpe

ok, I found this https://groups.google.com/g/delta-users/c/x2xnHkysIZw - CREATE and INSERT SQL syntax is not supported

kpe avatar Apr 27 '22 18:04 kpe

Thanks for creating this issue. As mentioned in the email group here, it requires upgrade to Spark 3.4.

vkorukanti avatar Apr 28 '22 17:04 vkorukanti

Thanks @vkorukanti, I don't see a reference to Spark 3.4 in the above link. Building the current spark master branch (3.4.0-SNAPSHOT) doesn't seem to help either. I guess for the GENERATED column syntax to work, we'll first need to see a change in DeltaSqlParser.scala or SqlBaseParser.sql.

kpe avatar Apr 29 '22 21:04 kpe

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

zsxwing avatar Apr 29 '22 21:04 zsxwing

@kpe This is not in Spark yet. We are talking to the Spark community and hope we can get this syntax supported in Spark 3.4. It's a SQL standard so it should not be hard to add the parser support. And you are right. We need to add this to SqlBaseParser.g4.

Thank you for doing that! Do you have a link to that community discussion? Very eager to get this in!

vegarsti avatar May 11 '22 13:05 vegarsti

just as a side note - it is interesting to note, that Apache Iceberg has a similar/related feature "partitioning transform" which uses syntax like:

...
PARTITIONED BY (months(ts), ...)
...

with months() being a partition transform. (I guess, very often the GENERATED columns in delta will be used for defining a particular partitioning; in iceberg the syntax for this use case is less explicit, but seems to fit nicely - i.e. if you only need the GENERATED column to set the partitioning, you might not want it to be shown in a SELECT *; using a PARTITIONED BY (transform(column)) is also quite SQL-idiomatic - this might be worth considering, while discussing the SQL syntax changes with the Spark team).

kpe avatar May 11 '22 14:05 kpe

@kpe Can you clarify something? are you saying that when iceberg uses PARTITIONED BY (transform(column)), then SELECT * does not show that generated partition column? Then how do users inspect that column value if something has gone wrong in that generation of its value?

In general, we designed Generated Column as the more general solution to such things that can be used on any column and any expressions, not just for a limited set of expressions and only for partitioning. This approach, along with the GENERATED ALWAYS AS syntax is actually SQL standard compliant (see Mysql, Postgres). APIs that adhere to long-standing SQL standards is usually more SQL-idiomatic than system-specific syntax. It ensures zero-learning curve for any SQL user who are already familiar with such standards when using other SQL engines. This will be our argument for accepting this syntax in the Spark. Finger's crossed that this works!

tdas avatar May 11 '22 16:05 tdas

@tdas Thank you for the explanation! It all make sense, and I'm looking forward to use the GENERATED syntax in delta open source soon.

(without being an iceberg advocate or even a big data practitioner, I guess, iceberg's motivation around their partition handling, is to allow for queries without explicitly specified partitioning information, which could be redundant and error-prone, while allowing for partition schema evolution without breaking existing queries )

kpe avatar May 12 '22 07:05 kpe

@kpe I see! Thank you for giving that additional information! Definitely food for thought for us.

tdas avatar May 12 '22 15:05 tdas

can we expect the SQL syntax for GENERATED columns to become available in Delta 2.0? (now that delta is to be fully open sourced)

kpe avatar Jul 03 '22 10:07 kpe

Hey @kpe as mentioned earlier in this thread this requires changes in Spark. Making those necessary changes is on our roadmap but we will still have to wait for the next Spark release and upgrade our version.

allisonport-db avatar Jul 14 '22 00:07 allisonport-db

hi @allisonport-db & @zsxwing any updates concerning including this into Spark 3.4?

keen85 avatar Nov 04 '22 07:11 keen85

Working to get this into 3.4 tracking via SPARK-41290

allisonport-db avatar Nov 28 '22 04:11 allisonport-db

SPARK-41290 closed 🔥

jaceklaskowski avatar Mar 02 '23 14:03 jaceklaskowski

SPARK-41290 also contains changes required to https://github.com/delta-io/delta/issues/1072?

felipepessoto avatar Apr 06 '23 23:04 felipepessoto

SPARK-41290 also contains changes required to #1072?

No. It doesn't add the identity column syntax.

zsxwing avatar Apr 13 '23 18:04 zsxwing

spark 3.4 supports generated always as syntax, but delta 2.4 doesn't support it yet, when will it get support?

kevinclcn avatar May 27 '23 03:05 kevinclcn

Any update on this issue, In latest Delta 3.0 also this issue not fixed.

VeeraswamyGatta avatar Oct 24 '23 17:10 VeeraswamyGatta