iceberg [Parquet] Eagerly fetch row groups when reading parquet

About the changes

This changes attempts to pre-fetch row groups while reading parquet files. This is first part of the changes proposed in https://github.com/apache/iceberg/issues/647

cc @jackye1995 @rdblue

Apr 04 '23 19:04 singhpk234

one more idea on this context and at a higher level, is that can we prefetch next task and keep it in memory and expose an iterator around it, when we are reading a task groups to be specific here

https://github.com/apache/iceberg/blob/49e930877a16bce2df51d6e51b737d2969208644/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java#L132-L157

so that we don't have to wait to read next task and directly ask iterator on it.

cc @jackye1995 @rdblue @aokolnychyi please let me know you thoughts on this as well.

Apr 04 '23 19:04 singhpk234

cc @aokolnychyi

Apr 06 '23 17:04 singhpk234

Should we use some parameters to enable this or not? Because prefetch row groups meaning need more memory. @singhpk234 is there any benchmark result to show?

Apr 10 '23 09:04 ConeyLiu

@singhpk234 is there any benchmark result to show?

@ConeyLiu working on the benchmarks

Apr 13 '23 17:04 singhpk234

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

Aug 28 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sep 05 '24 00:09 github-actions[bot]