gravitino [#3264] feat(spark-connector): Support Iceberg time travel in SQL queries

What changes were proposed in this pull request?

Support Iceberg time travel in SQL queries

Why are the changes needed?

supports time travel in SQL queries using TIMESTAMP AS OF, FOR SYSTEM_TIME AS OF or VERSION AS OF, FOR SYSTEM_VERSION AS OF clauses.

Fix: https://github.com/datastrato/gravitino/issues/3264

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New ITs.

May 04 '24 06:05 caican00

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.
However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

May 15 '24 07:05 caican00

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.

However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

cc @FANNG1

May 15 '24 07:05 caican00

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

May 16 '24 00:05 FANNG1

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

May 16 '24 02:05 caican00

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

Let me think a while

May 16 '24 04:05 FANNG1

fixed conflict, and could you please help review again if you are free? Thank you @FANNG1

May 17 '24 10:05 caican00

LGTM, except minor comments

May 20 '24 02:05 FANNG1

LGTM, except minor comments

comments have been addressed, and could you please help review again? @FANNG1

May 20 '24 03:05 caican00

@caican00 , thanks for your contribution!

May 20 '24 05:05 FANNG1

gravitino gravitino copied to clipboard

[#3264] feat(spark-connector): Support Iceberg time travel in SQL queries

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gravitino
gravitino copied to clipboard