iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

[Parquet] Eagerly fetch row groups when reading parquet

Open singhpk234 opened this issue 2 years ago • 5 comments

About the changes

This changes attempts to pre-fetch row groups while reading parquet files. This is first part of the changes proposed in https://github.com/apache/iceberg/issues/647

cc @jackye1995 @rdblue

singhpk234 avatar Apr 04 '23 19:04 singhpk234

one more idea on this context and at a higher level, is that can we prefetch next task and keep it in memory and expose an iterator around it, when we are reading a task groups to be specific here

https://github.com/apache/iceberg/blob/49e930877a16bce2df51d6e51b737d2969208644/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java#L132-L157

so that we don't have to wait to read next task and directly ask iterator on it.

cc @jackye1995 @rdblue @aokolnychyi please let me know you thoughts on this as well.

singhpk234 avatar Apr 04 '23 19:04 singhpk234

cc @aokolnychyi

singhpk234 avatar Apr 06 '23 17:04 singhpk234

Should we use some parameters to enable this or not? Because prefetch row groups meaning need more memory. @singhpk234 is there any benchmark result to show?

ConeyLiu avatar Apr 10 '23 09:04 ConeyLiu

@singhpk234 is there any benchmark result to show?

@ConeyLiu working on the benchmarks

singhpk234 avatar Apr 13 '23 17:04 singhpk234

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 05 '24 00:09 github-actions[bot]