[Parquet] Eagerly fetch row groups when reading parquet
About the changes
This changes attempts to pre-fetch row groups while reading parquet files. This is first part of the changes proposed in https://github.com/apache/iceberg/issues/647
cc @jackye1995 @rdblue
one more idea on this context and at a higher level, is that can we prefetch next task and keep it in memory and expose an iterator around it, when we are reading a task groups to be specific here
https://github.com/apache/iceberg/blob/49e930877a16bce2df51d6e51b737d2969208644/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java#L132-L157
so that we don't have to wait to read next task and directly ask iterator on it.
cc @jackye1995 @rdblue @aokolnychyi please let me know you thoughts on this as well.
cc @aokolnychyi
Should we use some parameters to enable this or not? Because prefetch row groups meaning need more memory. @singhpk234 is there any benchmark result to show?
@singhpk234 is there any benchmark result to show?
@ConeyLiu working on the benchmarks
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.