arrow
arrow copied to clipboard
ARROW-17318: [C++][Dataset] Support async streaming interface for getting fragments in Dataset
Add GetFragmentsAsync()
and GetFragmentsAsyncImpl()
functions to the generic Dataset
interface, which
allows to produce fragments in a streamed fashion.
This is one of the prerequisites for making
FileSystemDataset
to support lazy fragment
processing, which, in turn, can be used to start
scan operations without waiting for the entire
dataset to be discovered.
To aid the transition process of moving to async
implementation in Dataset
/AsyncScanner
code,
a default implementation for GetFragmentsAsyncImpl()
is provided (yielding a VectorGenerator over
the fragments vector, which is stored by every
implementation of Dataset interface at the moment).
Tests: unit(release)
Signed-off-by: Pavel Solodovnikov [email protected]
https://issues.apache.org/jira/browse/ARROW-17318
:warning: Ticket has no components in JIRA, make sure you assign one.
Force-pushed the branch, addressed style comments from @westonpace. Diff can be found here: https://github.com/apache/arrow/compare/fae325e7c553cc857ae9d05c757d5ff90e646260..da4235e3c08b46a8ff7a5207a93c03db75ce5515
Rebased and force-pushed the branch. Unfortunately, I cannot attach a link to the diff because rebasing broke it.
Changes summary:
- Added
utils/async_generator_fwd.h
-
GetFragmentsAsyncImpl()
forwardsGetFragmentsImpl()
toMakeBackgroundGenerator() + MakeTransferredGenerator()
instead of.ToVector()
:ing it.
@westonpace @pitrou Polite review ping.
Force-pushed the branch to address review comments from @westonpace. The diff can be found here: https://github.com/apache/arrow/compare/e91396ccf22eec394f23369255a9fd65be60b274..a6c5d5075b8150cf4ecf6874ad59a0e8497af93c
Changelog:
- Added a non-virtual
GetFragmentsAsyncImplBase
, which accepts aarrow::internal::Executor*
which is used as a destination executor for the background generator. - Default implementation of virtual
GetFragmentsAsyncImpl
now callsGetFragmentsAsyncImplBase
with the default executorinternal::GetCPUThreadPool()
. - Expanded the comments section for the default impl method.
Force-pushed the branch to address review comments. The diff can be found there: https://github.com/apache/arrow/compare/a6c5d5075b8150cf4ecf6874ad59a0e8497af93c..b47679f0708ed736e75c0254f5654cae9d9abbe4
Changelog:
- Fixed include style, reordered includes
-
GetFragmentsAsyncImpl()
accepts aexecutor
argument, defaulting toGetCPUThreadPool()
in the default implementation ofDataset
class - Fixed a bug in
DatasetFixtureMixin::AssertFragmentEquals()
which incorrectly assumed that the fragment scanner would always drain the batch generator - Added a test for
Dataset::GetFragments
by utilizing theAssertDatasetFragmentsEqual()
helper, which was unused until now - Added a similar test for
Dataset::GetFragmentsAsync
along with a similar helperDatasetFixtureMixin::AssertDatasetAsyncFragmentsEqual()
@westonpace @pitrou Polite review ping.
Benchmark runs are scheduled for baseline = ab71673ce0955798645ae9178018f562a82ed7f2 and contender = 4f31bfc2ffed603089c8bcd3e44ae0950f171126. 4f31bfc2ffed603089c8bcd3e44ae0950f171126 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.58% :arrow_up:0.0%] test-mac-arm
[Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:0.18% :arrow_up:0.07%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 4f31bfc2
ec2-t3-xlarge-us-east-2
[Failed] 4f31bfc2
test-mac-arm
[Failed] 4f31bfc2
ursa-i9-9960x
[Finished] 4f31bfc2
ursa-thinkcentre-m75q
[Finished] ab71673c
ec2-t3-xlarge-us-east-2
[Failed] ab71673c
test-mac-arm
[Failed] ab71673c
ursa-i9-9960x
[Finished] ab71673c
ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
['Python', 'R'] benchmarks have high level of regressions. test-mac-arm