arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17318: [C++][Dataset] Support async streaming interface for getting fragments in Dataset

Open ManManson opened this issue 2 years ago • 4 comments

Add GetFragmentsAsync() and GetFragmentsAsyncImpl() functions to the generic Dataset interface, which allows to produce fragments in a streamed fashion.

This is one of the prerequisites for making FileSystemDataset to support lazy fragment processing, which, in turn, can be used to start scan operations without waiting for the entire dataset to be discovered.

To aid the transition process of moving to async implementation in Dataset/AsyncScanner code, a default implementation for GetFragmentsAsyncImpl() is provided (yielding a VectorGenerator over the fragments vector, which is stored by every implementation of Dataset interface at the moment).

Tests: unit(release)

Signed-off-by: Pavel Solodovnikov [email protected]

ManManson avatar Aug 05 '22 13:08 ManManson

https://issues.apache.org/jira/browse/ARROW-17318

github-actions[bot] avatar Aug 05 '22 13:08 github-actions[bot]

:warning: Ticket has no components in JIRA, make sure you assign one.

github-actions[bot] avatar Aug 05 '22 13:08 github-actions[bot]

Force-pushed the branch, addressed style comments from @westonpace. Diff can be found here: https://github.com/apache/arrow/compare/fae325e7c553cc857ae9d05c757d5ff90e646260..da4235e3c08b46a8ff7a5207a93c03db75ce5515

ManManson avatar Aug 05 '22 16:08 ManManson

Rebased and force-pushed the branch. Unfortunately, I cannot attach a link to the diff because rebasing broke it.

Changes summary:

  • Added utils/async_generator_fwd.h
  • GetFragmentsAsyncImpl() forwards GetFragmentsImpl() to MakeBackgroundGenerator() + MakeTransferredGenerator() instead of .ToVector():ing it.

ManManson avatar Aug 10 '22 19:08 ManManson

@westonpace @pitrou Polite review ping.

ManManson avatar Aug 22 '22 06:08 ManManson

Force-pushed the branch to address review comments from @westonpace. The diff can be found here: https://github.com/apache/arrow/compare/e91396ccf22eec394f23369255a9fd65be60b274..a6c5d5075b8150cf4ecf6874ad59a0e8497af93c

Changelog:

  • Added a non-virtual GetFragmentsAsyncImplBase, which accepts a arrow::internal::Executor* which is used as a destination executor for the background generator.
  • Default implementation of virtual GetFragmentsAsyncImpl now calls GetFragmentsAsyncImplBase with the default executor internal::GetCPUThreadPool().
  • Expanded the comments section for the default impl method.

ManManson avatar Aug 23 '22 13:08 ManManson

Force-pushed the branch to address review comments. The diff can be found there: https://github.com/apache/arrow/compare/a6c5d5075b8150cf4ecf6874ad59a0e8497af93c..b47679f0708ed736e75c0254f5654cae9d9abbe4

Changelog:

  • Fixed include style, reordered includes
  • GetFragmentsAsyncImpl() accepts a executor argument, defaulting to GetCPUThreadPool() in the default implementation of Dataset class
  • Fixed a bug in DatasetFixtureMixin::AssertFragmentEquals() which incorrectly assumed that the fragment scanner would always drain the batch generator
  • Added a test for Dataset::GetFragments by utilizing the AssertDatasetFragmentsEqual() helper, which was unused until now
  • Added a similar test for Dataset::GetFragmentsAsync along with a similar helper DatasetFixtureMixin::AssertDatasetAsyncFragmentsEqual()

ManManson avatar Sep 07 '22 18:09 ManManson

@westonpace @pitrou Polite review ping.

ManManson avatar Sep 20 '22 08:09 ManManson

Benchmark runs are scheduled for baseline = ab71673ce0955798645ae9178018f562a82ed7f2 and contender = 4f31bfc2ffed603089c8bcd3e44ae0950f171126. 4f31bfc2ffed603089c8bcd3e44ae0950f171126 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.58% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.18% :arrow_up:0.07%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 4f31bfc2 ec2-t3-xlarge-us-east-2 [Failed] 4f31bfc2 test-mac-arm [Failed] 4f31bfc2 ursa-i9-9960x [Finished] 4f31bfc2 ursa-thinkcentre-m75q [Finished] ab71673c ec2-t3-xlarge-us-east-2 [Failed] ab71673c test-mac-arm [Failed] ab71673c ursa-i9-9960x [Finished] ab71673c ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Sep 20 '22 14:09 ursabot

['Python', 'R'] benchmarks have high level of regressions. test-mac-arm

ursabot avatar Sep 20 '22 14:09 ursabot