Spark Integration to read from Snapshot ref
Issue adressed: https://github.com/apache/iceberg/issues/3899
This PR provides a way to spark query using snapshot ref
@amogh-jahagirdar Took a pull from your PR. let me know if my commit https://github.com/apache/iceberg/pull/5150/commits/8132d205f895208f36344dcc8dcc2794052146ab looks good.
@namrathamyske Thanks for this! High level comment, I think we should separate the API and the Spark integration changes. Also at the API level, I think it makes sense to separate useBranch and useTag, rather than having one useSnapshotRef , because branching can be combined with time travel, but tagging cannot be; although that could just be an implementation detail we handle. Semantically from an API perspective though it seems cleaner to separate the 2.
Let me know what you think. Checkout this thread https://github.com/apache/iceberg/pull/4428#discussion_r838741565
I have this PR https://github.com/apache/iceberg/pull/5364 for branching + time travel, I think we could do a separate one for tagging.
@rdblue @amogh-jahagirdar Thanks for your review! working on changes for this!
@namrathamyske, it looks like some of these changes went in with #5364. Can you rebase on that?
I came across duplicate implementations of using timestamp option in https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L185 and https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L629. I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java. @rdblue @amogh-jahagirdar any thoughts appreciated.
I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java
The current implementation looks fine to me. If those code paths are taken, it indicates that Spark was passed syntax like TIMESTAMP AS OF '...', which should be incompatible with branch or tag options. This PR already implements that because the snapshot passed to SparkTable is added to options.
This looks good to me. I'm rerunning CI since the failures don't look related to this.