arrow
arrow copied to clipboard
ARROW-17306: [C++] Provide an optimized `GetFileInfoGenerator` specialization for `LocalFileSystem`
Introduce a helper class AsyncStatSelector
, which contains
an optimized specialization for GetFileInfoGenerator
in the
LocalFileSystem
class.
There are two variants of async discovery functions suported:
-
DiscoverPartitionFiles
, which parallelizes traversal of individual directories so that each directory results are yielded as a separateFileInfoGenerator
via an underlyingDiscoveryImplIterator
, which delivers items in chunks (default size iskDefaultFileInfoBatchSize == 1K
items). -
DiscoverPartitionsFlattened
, which forwards execution to theDiscoverPartitionFiles
, with the difference that the results from individual sub-directory iterators are merged into the single FileInfoGenerator stream.
This option is disabled by default, so that individual
directories are processed in serial manner via
MakeConcatenatedGenerator
under the hood.
Also, internal batching can also be configured by
LocalFileSystemOptions::file_info_batch_size
, which specifies
how many FileInfo
:s should be batched into a single
FileInfoVector
to yield in a single DirectoryImplIterator::Next()
invocation.
This implementation uses an auxiliary class named
AutoClosingProducer
to make sure that the producer
side of internal PushGenerator
stays open until
all of the DiscoveryImplIterator
:s are exhausted.
It is done by wrapping the producer into a shared state,
which Close()
:s the producer once ref-count for wrapped
producer reaches zero.
Tests: unit(release)
Signed-off-by: Pavel Solodovnikov [email protected] Co-Authored-by: Igor Seliverstov [email protected]
https://issues.apache.org/jira/browse/ARROW-17306
Force-pushed the branch to fix some build issues on macOS and Windows.
Changelog can be found here: https://github.com/apache/arrow/compare/b6714018c4384cd00f0b9a4e92d804671d1d381b..3f2b141dea8ded351d41d687245917821295277a?diff=unified
I believe appveyor failures don't have anything to do with my patches... Also, "Dev / Source Merge Script" jobs seem to fail due to some issues related to 10.0.0 freeze?
Agreed that the test failures are unrelated. The Appveyor issues appears to be addressed already by https://github.com/apache/arrow/pull/13795
Force-pushed the branch to address review comments. Diff can be found here: https://github.com/apache/arrow/compare/3f2b141dea8ded351d41d687245917821295277a..136ae80540f851219c8d9a920470d4863a428d30
Changelog:
- Rename
partitions_readahead
todirectory_readahead
, as well as the methods inAsyncStatSelector
to get rid ofPartitions
terminology. - Moved
directory_readahead
andbatch_size
fromFileSelector
toLocalFileSystemOptions
. - Removed use of
util::optional
from these options by providing sensible defaults (1 fordirectory_readahead
and 1000 forbatch_size
). - Optimize memory allocation for internal chunking.
Thanks for the update, I posted a couple comments.
Also, can you add some localfs-specific tests for this? Ideally you would stress both with and without directory readahead...
Sure, I'll add some stress tests for the feature to cover both batching and processing parallelism options.
Force-pushed the branch to address review comments from @pitrou and @westonpace . The diff can be found here: https://github.com/apache/arrow/compare/136ae80540f851219c8d9a920470d4863a428d30..ac2954d39b5de052b3c05563d290f00c23b51b34, also pushed a fixup to amend commit message.
Changelog:
- Change new option fields types from
uint32_t
toint32_t
- Renamed
batch_size
tofile_info_batch_size
- Incorporated some code simplifications proposed by @pitrou
- Added a benchmark for the new impl of
LocalFileSystem::GetFileInfoGenerator
- Discovered a bug with prematurely closed producer while running the benchmark, somehow it escaped manual testing until now... Introduced
AutoClosingProducer
to solve the problem
Force-pushed the branch to address review comments from @pitrou The diff can be found there: https://github.com/apache/arrow/compare/affcca7fa308c897ff5a358d1668486abed6c5fa..09142855f750ea046fb081c17d23b5fa675f7e9f
Changelog:
- Removed
AutoClosingProducer
in favor of a more simplistic and correctstd::shared_ptr<DiscoveryState>
approach - Sub-generators are scheduled on the same IO executor so that we don't do redundant transitions between CPU and IO threads
- Pass
io_executor
from theLocalFileSystem
down to the discovery algorithm, so that we don't have to use theio::get_default_io_context()
- Fixes to comments
- Minor code fixes (e.g. swapping current chunk contens in
DiscoveryImplIterator
) proposed by @pitrou - Remove unused
boost
dependencies for localfs benchmark - Remove dependency on parquet in the localfs benchmark
- Reduce the number of argument ranges to the benchmark
I'm running this benchmark locally on Ubuntu 20.04, 24-thread CPU, ext4 filesystem on a fast SSD.
Out of curiosity I added different dataset sizes into the mix as well as file_info_batch_size = 10
.
- With
num_files_ = 10000
(total ~1 million files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time 2668 ms 0.872 ms 1 items_per_second=416.072k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time 2301 ms 0.857 ms 1 items_per_second=482.503k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time 2344 ms 0.784 ms 1 items_per_second=473.564k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time 2312 ms 0.799 ms 1 items_per_second=480.162k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time 1555 ms 0.717 ms 1 items_per_second=713.993k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time 1509 ms 0.698 ms 1 items_per_second=735.592k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time 2676 ms 0.880 ms 1 items_per_second=414.832k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time 634 ms 0.792 ms 1 items_per_second=1.75228M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time 285 ms 0.764 ms 3 items_per_second=3.89016M/s
- With
num_files_ = 1000
(total ~100 thousands files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time 261 ms 0.086 ms 3 items_per_second=425.581k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time 186 ms 0.080 ms 4 items_per_second=595.899k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time 185 ms 0.088 ms 4 items_per_second=601.398k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time 296 ms 0.096 ms 2 items_per_second=375.29k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time 66.4 ms 0.081 ms 10 items_per_second=1.6738M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time 29.6 ms 0.067 ms 24 items_per_second=3.74895M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time 304 ms 0.091 ms 2 items_per_second=365.748k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time 70.2 ms 0.085 ms 10 items_per_second=1.58328M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time 32.7 ms 0.051 ms 21 items_per_second=3.40023M/s
- With
num_files_ = 10
(total ~1000 files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time 4.24 ms 0.021 ms 187 items_per_second=287.682k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time 1.87 ms 0.025 ms 352 items_per_second=650.697k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time 1.59 ms 0.024 ms 455 items_per_second=769.645k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time 6.21 ms 0.026 ms 106 items_per_second=196.586k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time 1.86 ms 0.024 ms 364 items_per_second=657.369k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time 1.60 ms 0.024 ms 435 items_per_second=762.434k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time 6.18 ms 0.022 ms 110 items_per_second=197.411k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time 1.86 ms 0.025 ms 378 items_per_second=657.252k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time 1.58 ms 0.024 ms 463 items_per_second=774.176k/s
- With
num_files_ = 10, num_dirs_ = 1
(total ~30 files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time 0.105 ms 0.015 ms 6795 items_per_second=304.24k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time 0.133 ms 0.021 ms 5239 items_per_second=240.955k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time 0.133 ms 0.021 ms 5241 items_per_second=240.891k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time 0.157 ms 0.018 ms 4352 items_per_second=203.893k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time 0.131 ms 0.021 ms 5290 items_per_second=244.757k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time 0.138 ms 0.022 ms 5471 items_per_second=231.11k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time 0.158 ms 0.018 ms 4294 items_per_second=201.957k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time 0.131 ms 0.022 ms 5323 items_per_second=243.422k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time 0.134 ms 0.021 ms 5188 items_per_second=239.693k/s
These numbers seem to support a default readahead of 16 and a default batch size of 1000.
@pitrou Your numbers are in line with what I get on my machine. I agree we can tune the defaults to be kDefaultDirectoryReadahead = 16
and kDefaultFileInfoBatchSize = 1000
.
Force-pushed the branch to address some follow-up comments. The diff can be found here: https://github.com/apache/arrow/compare/09142855f750ea046fb081c17d23b5fa675f7e9f..eb77fad86834a6c10244eb924197e949b7faba3f
Changelog:
- Set
kDefaultDirectoryReadahead
to 16 instead of 1, based on benchmarking results - Fixed
SetItemsProcessed
handling in the localfs benchmark - Reduce
num_files_
from 10k to 1k in the localfs benchmark so that it stays fast enough for a wider range of hw configurations
I'll take a last look here.
Benchmark runs are scheduled for baseline = bc52f9f0e582474501e92e6a281f0110754a8af1 and contender = a1c3d57af514d4a84e753ff51df8e563135ee55e. a1c3d57af514d4a84e753ff51df8e563135ee55e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:2.72% :arrow_up:3.16%] test-mac-arm
[Failed :arrow_down:1.92% :arrow_up:1.37%] ursa-i9-9960x
[Finished :arrow_down:10.22% :arrow_up:8.31%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a1c3d57a
ec2-t3-xlarge-us-east-2
[Failed] a1c3d57a
test-mac-arm
[Failed] a1c3d57a
ursa-i9-9960x
[Finished] a1c3d57a
ursa-thinkcentre-m75q
[Finished] bc52f9f0
ec2-t3-xlarge-us-east-2
[Finished] bc52f9f0
test-mac-arm
[Failed] bc52f9f0
ursa-i9-9960x
[Finished] bc52f9f0
ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
['Python', 'R'] benchmarks have high level of regressions. ursa-i9-9960x