gravitino icon indicating copy to clipboard operation
gravitino copied to clipboard

[#3264] feat(spark-connector): Support Iceberg time travel in SQL queries

Open caican00 opened this issue 9 months ago • 5 comments

What changes were proposed in this pull request?

Support Iceberg time travel in SQL queries

Why are the changes needed?

supports time travel in SQL queries using TIMESTAMP AS OF, FOR SYSTEM_TIME AS OF or VERSION AS OF, FOR SYSTEM_VERSION AS OF clauses.

Fix: https://github.com/datastrato/gravitino/issues/3264

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New ITs.

caican00 avatar May 04 '24 06:05 caican00

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

  1. Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.

  2. However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

  1. Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

caican00 avatar May 15 '24 07:05 caican00

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

  1. Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.
  2. However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

  1. Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

cc @FANNG1

caican00 avatar May 15 '24 07:05 caican00

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

FANNG1 avatar May 16 '24 00:05 FANNG1

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

caican00 avatar May 16 '24 02:05 caican00

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

Let me think a while

FANNG1 avatar May 16 '24 04:05 FANNG1

fixed conflict, and could you please help review again if you are free? Thank you @FANNG1

caican00 avatar May 17 '24 10:05 caican00

LGTM, except minor comments

FANNG1 avatar May 20 '24 02:05 FANNG1

LGTM, except minor comments

comments have been addressed, and could you please help review again? @FANNG1

caican00 avatar May 20 '24 03:05 caican00

@caican00 , thanks for your contribution!

FANNG1 avatar May 20 '24 05:05 FANNG1