arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17287: [C++] Create scan node that doesn't rely on the merged generator

Open westonpace opened this issue 3 years ago • 4 comments

Primary Goal: Create a scanner that "cancels" properly. In other words, when the scan node is marked finished then all scan-related thread tasks will be finished. This is different than the current model where I/O tasks are allowed to keep parts of the scan alive via captures of shared_ptr state.

Secondary Goal: Remove our dependency on the merged generator and make the scanner more accessible. The merged generator is complicated and does not support cancellation, and it currently only understood by a very small set of people.

Secondary Goal: Add interfaces for schema evolution. This wasn't originally a goal but arose from my attempt to codify and normalize what we are currently doing. These interfaces should eventually allow for things like filling a missing field with a default value or using the parquet column id for field resolution.

Performance isn't a goal for this rework but ideally this should not degrade performance.

westonpace avatar Aug 02 '22 18:08 westonpace

https://issues.apache.org/jira/browse/ARROW-17287

github-actions[bot] avatar Aug 02 '22 18:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 02 '22 18:08 github-actions[bot]

This is still very much a draft. However, I have the basic path working. I will need to migrate over the existing scanner tests and test the various failure paths (as well as stress test for race conditions, there is potential for plenty here).

I'm very curious what people think about the new scan options as well as the new evolution interfaces.

westonpace avatar Aug 02 '22 18:08 westonpace

CC @save-buffer @marsupialtail who might be interested in this as well as we have spoken on similar topics

westonpace avatar Aug 02 '22 18:08 westonpace

Benchmark runs are scheduled for baseline = 89c0214fa43f8d1bf2e19e3bae0fc3009df51e15 and contender = ec579df631deaa8f6186208ed2a4ebec00581dfa. ec579df631deaa8f6186208ed2a4ebec00581dfa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.03% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.27% :arrow_up:0.27%] ursa-i9-9960x [Finished :arrow_down:1.45% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] ec579df6 ec2-t3-xlarge-us-east-2 [Failed] ec579df6 test-mac-arm [Failed] ec579df6 ursa-i9-9960x [Finished] ec579df6 ursa-thinkcentre-m75q [Finished] 89c0214f ec2-t3-xlarge-us-east-2 [Finished] 89c0214f test-mac-arm [Failed] 89c0214f ursa-i9-9960x [Finished] 89c0214f ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Oct 03 '22 14:10 ursabot