gravitino
gravitino copied to clipboard
[#3264] feat(spark-connector): Support Iceberg time travel in SQL queries
What changes were proposed in this pull request?
Support Iceberg time travel in SQL queries
Why are the changes needed?
supports time travel in SQL queries using TIMESTAMP AS OF
, FOR SYSTEM_TIME AS OF
or VERSION AS OF
, FOR SYSTEM_VERSION AS OF
clauses.
Fix: https://github.com/datastrato/gravitino/issues/3264
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New ITs.
Finally, I still choose to implement Iceberg time travel
by overriding newScanBuilder
, for the following reasons:
-
Although
SparkIcebergTable
extendedSparkTable
, it still needs to initialize its member variables, such assnapshotId
orbranch
, before it can directly reuse thenewScanBuilder
implementation ofSparkTable
. -
However, initializing
snapshotId
orbranch
is difficult, not as easy as initializingrefreshEagerly
, because it is difficult to determinesnapshotId
orbranch
is initialized in the real sparkTable. Therefore, it is difficult to selectively initializesnapshotId
orbranch
through thesuper
method.
In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199
- Therefore, overriding
newScanBuilder
is more convenient and does not introduce too much maintenance burden.
Finally, I still choose to implement Iceberg
time travel
by overridingnewScanBuilder
, for the following reasons:
- Although
SparkIcebergTable
extendedSparkTable
, it still needs to initialize its member variables, such assnapshotId
orbranch
, before it can directly reuse thenewScanBuilder
implementation ofSparkTable
.- However, initializing
snapshotId
orbranch
is difficult, not as easy as initializingrefreshEagerly
, because it is difficult to determinesnapshotId
orbranch
is initialized in the real sparkTable. Therefore, it is difficult to selectively initializesnapshotId
orbranch
through thesuper
method.In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199
- Therefore, overriding
newScanBuilder
is more convenient and does not introduce too much maintenance burden.
cc @FANNG1
How about Override loadTable(Identifier ident, String version)
and loadTable(Identifier ident, String version)
for GravitinoIcebergCatalog
?
How about Override
loadTable(Identifier ident, String version)
andloadTable(Identifier ident, String version)
forGravitinoIcebergCatalog
?
in this way, we still have the problem of initializing the snapshotId
or branch
when invoking super
method of SparkTable
.
How about Override
loadTable(Identifier ident, String version)
andloadTable(Identifier ident, String version)
forGravitinoIcebergCatalog
?in this way, we still have the problem of initializing the
snapshotId
orbranch
when invokingsuper
method ofSparkTable
.
Let me think a while
fixed conflict, and could you please help review again if you are free? Thank you @FANNG1
LGTM, except minor comments
LGTM, except minor comments
comments have been addressed, and could you please help review again? @FANNG1
@caican00 , thanks for your contribution!