arrow icon indicating copy to clipboard operation
arrow copied to clipboard

PARQUET-2188: [parquet-cpp] Add SkipRecords API to RecordReader

Open fatemehp opened this issue 3 years ago • 16 comments

The RecordReader is missing an API to skip records. There is a Skip method in the ColumnReader, but that skips based on the number of values/levels and not records. For repeated fields, this SkipRecords API will detect the record boundaries and correctly skip the right number of values for the requested number of records.

fatemehp avatar Sep 15 '22 17:09 fatemehp

@emkornfield could you take a look?

fatemehp avatar Sep 15 '22 17:09 fatemehp

https://issues.apache.org/jira/browse/PARQUET-2188

github-actions[bot] avatar Sep 15 '22 19:09 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Sep 15 '22 19:09 github-actions[bot]

@github-actions autotune

emkornfield avatar Sep 19 '22 17:09 emkornfield

I addressed the comments. @emkornfield could you please take another look.

fatemehp avatar Sep 30 '22 23:09 fatemehp

The failing C++ job looks to be for parquet-reader, I think this is unrelated. @pitrou do you know if this is a known issue?

otherwise LGTM.

emkornfield avatar Oct 03 '22 21:10 emkornfield

Looks like this is an issue with the new test.

emkornfield avatar Oct 03 '22 22:10 emkornfield

I have fixed the issue and now the test passes asan. The rest of warnings seem ok to me. @emkornfield could you take another look?

fatemehp avatar Oct 04 '22 20:10 fatemehp

@pitrou I believe I have addressed all comments. Could you take another look? Thanks!

fatemehp avatar Oct 05 '22 23:10 fatemehp

@pitrou does this pull request look ready to be merged?

fatemehp avatar Oct 10 '22 16:10 fatemehp

@fatemehp I'll take a look when I can, but the 10.0.0 release is nearing and there are other priorities...

pitrou avatar Oct 10 '22 16:10 pitrou

I fixed a performance bug where we kept reading more levels into the buffer with each SkipRecords call. We should first consume what is in the buffer and only read more if we need to. @emkornfield could you take a look at the latest update?

fatemehp avatar Oct 17 '22 16:10 fatemehp

Addressed comments. @emkornfield, @pitrou could you take a look? Thanks!

fatemehp avatar Oct 25 '22 20:10 fatemehp

@pitrou did you want to rereview? Or are you OK if I merge this?

emkornfield avatar Oct 26 '22 20:10 emkornfield

@emkornfield I'll take a look, thank you!

pitrou avatar Oct 26 '22 21:10 pitrou

Thanks for your comments @pitrou. I think I have addressed them all. Please take a look. Thanks!

fatemehp avatar Oct 27 '22 20:10 fatemehp

Fixed lint and merged with latest master. I will wait for CI to run again.

pitrou avatar Oct 31 '22 15:10 pitrou

@pitrou, @emkornfield I think one thing that we did not discuss fully about this PR is that do we want this API to be public or protected? Do we want to allow every client to call SkipRecords at any time? I will be using this API for calling a sequence of Reads and Skips to fully exhaust a column chunk. However, that could be masked behind another public API that allows filtering based on row numbers and does not reveal SkipRecords.

fatemehp avatar Oct 31 '22 20:10 fatemehp

I think this class is already marked as internal which doesn't really provide guarantees, but I think we should probably answer first if we want to move it out of internal namespace?

emkornfield avatar Oct 31 '22 20:10 emkornfield

Benchmark runs are scheduled for baseline = f9ccca6243c2ab10b86f18f5f5caf25ae5ecac6a and contender = 11647857b2860973d4b99cd5d9e7132010089469. 11647857b2860973d4b99cd5d9e7132010089469 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0%] test-mac-arm [Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.54% :arrow_up:0.21%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 11647857 ec2-t3-xlarge-us-east-2 [Failed] 11647857 test-mac-arm [Finished] 11647857 ursa-i9-9960x [Finished] 11647857 ursa-thinkcentre-m75q [Finished] f9ccca62 ec2-t3-xlarge-us-east-2 [Failed] f9ccca62 test-mac-arm [Finished] f9ccca62 ursa-i9-9960x [Finished] f9ccca62 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Nov 01 '22 02:11 ursabot