iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark Integration to read from Snapshot ref

Open namrathamyske opened this issue 3 years ago • 3 comments

Issue adressed: https://github.com/apache/iceberg/issues/3899

This PR provides a way to spark query using snapshot ref

namrathamyske avatar Jun 28 '22 19:06 namrathamyske

@amogh-jahagirdar Took a pull from your PR. let me know if my commit https://github.com/apache/iceberg/pull/5150/commits/8132d205f895208f36344dcc8dcc2794052146ab looks good.

namrathamyske avatar Jun 28 '22 19:06 namrathamyske

@namrathamyske Thanks for this! High level comment, I think we should separate the API and the Spark integration changes. Also at the API level, I think it makes sense to separate useBranch and useTag, rather than having one useSnapshotRef , because branching can be combined with time travel, but tagging cannot be; although that could just be an implementation detail we handle. Semantically from an API perspective though it seems cleaner to separate the 2.

Let me know what you think. Checkout this thread https://github.com/apache/iceberg/pull/4428#discussion_r838741565

I have this PR https://github.com/apache/iceberg/pull/5364 for branching + time travel, I think we could do a separate one for tagging.

amogh-jahagirdar avatar Aug 07 '22 17:08 amogh-jahagirdar

@rdblue @amogh-jahagirdar Thanks for your review! working on changes for this!

namrathamyske avatar Aug 07 '22 23:08 namrathamyske

@namrathamyske, it looks like some of these changes went in with #5364. Can you rebase on that?

rdblue avatar Oct 17 '22 03:10 rdblue

I came across duplicate implementations of using timestamp option in https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L185 and https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L629. I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java. @rdblue @amogh-jahagirdar any thoughts appreciated.

namrathamyske avatar Oct 23 '22 04:10 namrathamyske

I am unsure of how to proceed for branches and tags usecase. Just making changes to read from branch/tag in SparkScanBuilder worked before for previous versions of spark - 3.1, 3.2. But for 3.3 , properties get overridden in SparkCatalog.java before reaching SparkScanBuilder.java

The current implementation looks fine to me. If those code paths are taken, it indicates that Spark was passed syntax like TIMESTAMP AS OF '...', which should be incompatible with branch or tag options. This PR already implements that because the snapshot passed to SparkTable is added to options.

rdblue avatar Nov 07 '22 00:11 rdblue

This looks good to me. I'm rerunning CI since the failures don't look related to this.

rdblue avatar Nov 07 '22 00:11 rdblue