arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17306: [C++] Provide an optimized `GetFileInfoGenerator` specialization for `LocalFileSystem`

Open ManManson opened this issue 2 years ago • 7 comments

Introduce a helper class AsyncStatSelector, which contains an optimized specialization for GetFileInfoGenerator in the LocalFileSystem class.

There are two variants of async discovery functions suported:

  1. DiscoverPartitionFiles, which parallelizes traversal of individual directories so that each directory results are yielded as a separate FileInfoGenerator via an underlying DiscoveryImplIterator, which delivers items in chunks (default size is kDefaultFileInfoBatchSize == 1K items).
  2. DiscoverPartitionsFlattened, which forwards execution to the DiscoverPartitionFiles, with the difference that the results from individual sub-directory iterators are merged into the single FileInfoGenerator stream.

This option is disabled by default, so that individual directories are processed in serial manner via MakeConcatenatedGenerator under the hood.

Also, internal batching can also be configured by LocalFileSystemOptions::file_info_batch_size, which specifies how many FileInfo:s should be batched into a single FileInfoVector to yield in a single DirectoryImplIterator::Next() invocation.

This implementation uses an auxiliary class named AutoClosingProducer to make sure that the producer side of internal PushGenerator stays open until all of the DiscoveryImplIterator:s are exhausted.

It is done by wrapping the producer into a shared state, which Close():s the producer once ref-count for wrapped producer reaches zero.

Tests: unit(release)

Signed-off-by: Pavel Solodovnikov [email protected] Co-Authored-by: Igor Seliverstov [email protected]

ManManson avatar Aug 04 '22 10:08 ManManson

https://issues.apache.org/jira/browse/ARROW-17306

github-actions[bot] avatar Aug 04 '22 11:08 github-actions[bot]

Force-pushed the branch to fix some build issues on macOS and Windows.

Changelog can be found here: https://github.com/apache/arrow/compare/b6714018c4384cd00f0b9a4e92d804671d1d381b..3f2b141dea8ded351d41d687245917821295277a?diff=unified

ManManson avatar Aug 04 '22 11:08 ManManson

I believe appveyor failures don't have anything to do with my patches... Also, "Dev / Source Merge Script" jobs seem to fail due to some issues related to 10.0.0 freeze?

ManManson avatar Aug 05 '22 16:08 ManManson

Agreed that the test failures are unrelated. The Appveyor issues appears to be addressed already by https://github.com/apache/arrow/pull/13795

westonpace avatar Aug 05 '22 19:08 westonpace

Force-pushed the branch to address review comments. Diff can be found here: https://github.com/apache/arrow/compare/3f2b141dea8ded351d41d687245917821295277a..136ae80540f851219c8d9a920470d4863a428d30

Changelog:

  • Rename partitions_readahead to directory_readahead, as well as the methods in AsyncStatSelector to get rid of Partitions terminology.
  • Moved directory_readahead and batch_size from FileSelector to LocalFileSystemOptions.
  • Removed use of util::optional from these options by providing sensible defaults (1 for directory_readahead and 1000 for batch_size).
  • Optimize memory allocation for internal chunking.

ManManson avatar Aug 08 '22 07:08 ManManson

Thanks for the update, I posted a couple comments.

Also, can you add some localfs-specific tests for this? Ideally you would stress both with and without directory readahead...

Sure, I'll add some stress tests for the feature to cover both batching and processing parallelism options.

ManManson avatar Aug 08 '22 08:08 ManManson

Force-pushed the branch to address review comments from @pitrou and @westonpace . The diff can be found here: https://github.com/apache/arrow/compare/136ae80540f851219c8d9a920470d4863a428d30..ac2954d39b5de052b3c05563d290f00c23b51b34, also pushed a fixup to amend commit message.

Changelog:

  • Change new option fields types from uint32_t to int32_t
  • Renamed batch_size to file_info_batch_size
  • Incorporated some code simplifications proposed by @pitrou
  • Added a benchmark for the new impl of LocalFileSystem::GetFileInfoGenerator
  • Discovered a bug with prematurely closed producer while running the benchmark, somehow it escaped manual testing until now... Introduced AutoClosingProducer to solve the problem

ManManson avatar Aug 12 '22 08:08 ManManson

Force-pushed the branch to address review comments from @pitrou The diff can be found there: https://github.com/apache/arrow/compare/affcca7fa308c897ff5a358d1668486abed6c5fa..09142855f750ea046fb081c17d23b5fa675f7e9f

Changelog:

  • Removed AutoClosingProducer in favor of a more simplistic and correct std::shared_ptr<DiscoveryState> approach
  • Sub-generators are scheduled on the same IO executor so that we don't do redundant transitions between CPU and IO threads
  • Pass io_executor from the LocalFileSystem down to the discovery algorithm, so that we don't have to use the io::get_default_io_context()
  • Fixes to comments
  • Minor code fixes (e.g. swapping current chunk contens in DiscoveryImplIterator) proposed by @pitrou
  • Remove unused boost dependencies for localfs benchmark
  • Remove dependency on parquet in the localfs benchmark
  • Reduce the number of argument ranges to the benchmark

ManManson avatar Aug 16 '22 20:08 ManManson

I'm running this benchmark locally on Ubuntu 20.04, 24-thread CPU, ext4 filesystem on a fast SSD. Out of curiosity I added different dataset sizes into the mix as well as file_info_batch_size = 10.

  • With num_files_ = 10000 (total ~1 million files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time          2668 ms        0.872 ms            1 items_per_second=416.072k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time          2301 ms        0.857 ms            1 items_per_second=482.503k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time         2344 ms        0.784 ms            1 items_per_second=473.564k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time         2312 ms        0.799 ms            1 items_per_second=480.162k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time         1555 ms        0.717 ms            1 items_per_second=713.993k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time        1509 ms        0.698 ms            1 items_per_second=735.592k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time        2676 ms        0.880 ms            1 items_per_second=414.832k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time         634 ms        0.792 ms            1 items_per_second=1.75228M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time        285 ms        0.764 ms            3 items_per_second=3.89016M/s
  • With num_files_ = 1000 (total ~100 thousands files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time           261 ms        0.086 ms            3 items_per_second=425.581k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time           186 ms        0.080 ms            4 items_per_second=595.899k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time          185 ms        0.088 ms            4 items_per_second=601.398k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time          296 ms        0.096 ms            2 items_per_second=375.29k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time         66.4 ms        0.081 ms           10 items_per_second=1.6738M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time        29.6 ms        0.067 ms           24 items_per_second=3.74895M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time         304 ms        0.091 ms            2 items_per_second=365.748k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time        70.2 ms        0.085 ms           10 items_per_second=1.58328M/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time       32.7 ms        0.051 ms           21 items_per_second=3.40023M/s
  • With num_files_ = 10 (total ~1000 files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time          4.24 ms        0.021 ms          187 items_per_second=287.682k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time          1.87 ms        0.025 ms          352 items_per_second=650.697k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time         1.59 ms        0.024 ms          455 items_per_second=769.645k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time         6.21 ms        0.026 ms          106 items_per_second=196.586k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time         1.86 ms        0.024 ms          364 items_per_second=657.369k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time        1.60 ms        0.024 ms          435 items_per_second=762.434k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time        6.18 ms        0.022 ms          110 items_per_second=197.411k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time        1.86 ms        0.025 ms          378 items_per_second=657.252k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time       1.58 ms        0.024 ms          463 items_per_second=774.176k/s
  • With num_files_ = 10, num_dirs_ = 1 (total ~30 files):
-------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:10/real_time         0.105 ms        0.015 ms         6795 items_per_second=304.24k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:10/real_time         0.133 ms        0.021 ms         5239 items_per_second=240.955k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:10/real_time        0.133 ms        0.021 ms         5241 items_per_second=240.891k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:100/real_time        0.157 ms        0.018 ms         4352 items_per_second=203.893k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:100/real_time        0.131 ms        0.021 ms         5290 items_per_second=244.757k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:100/real_time       0.138 ms        0.022 ms         5471 items_per_second=231.11k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:1/file_info_batch_size:1000/real_time       0.158 ms        0.018 ms         4294 items_per_second=201.957k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:4/file_info_batch_size:1000/real_time       0.131 ms        0.022 ms         5323 items_per_second=243.422k/s
LocalFSFixture/AsyncFileDiscovery/directory_readahead:16/file_info_batch_size:1000/real_time      0.134 ms        0.021 ms         5188 items_per_second=239.693k/s

These numbers seem to support a default readahead of 16 and a default batch size of 1000.

pitrou avatar Aug 17 '22 09:08 pitrou

@pitrou Your numbers are in line with what I get on my machine. I agree we can tune the defaults to be kDefaultDirectoryReadahead = 16 and kDefaultFileInfoBatchSize = 1000.

ManManson avatar Aug 17 '22 16:08 ManManson

Force-pushed the branch to address some follow-up comments. The diff can be found here: https://github.com/apache/arrow/compare/09142855f750ea046fb081c17d23b5fa675f7e9f..eb77fad86834a6c10244eb924197e949b7faba3f

Changelog:

  • Set kDefaultDirectoryReadahead to 16 instead of 1, based on benchmarking results
  • Fixed SetItemsProcessed handling in the localfs benchmark
  • Reduce num_files_ from 10k to 1k in the localfs benchmark so that it stays fast enough for a wider range of hw configurations

ManManson avatar Aug 18 '22 07:08 ManManson

I'll take a last look here.

pitrou avatar Aug 18 '22 10:08 pitrou

Benchmark runs are scheduled for baseline = bc52f9f0e582474501e92e6a281f0110754a8af1 and contender = a1c3d57af514d4a84e753ff51df8e563135ee55e. a1c3d57af514d4a84e753ff51df8e563135ee55e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:2.72% :arrow_up:3.16%] test-mac-arm [Failed :arrow_down:1.92% :arrow_up:1.37%] ursa-i9-9960x [Finished :arrow_down:10.22% :arrow_up:8.31%] ursa-thinkcentre-m75q Buildkite builds: [Finished] a1c3d57a ec2-t3-xlarge-us-east-2 [Failed] a1c3d57a test-mac-arm [Failed] a1c3d57a ursa-i9-9960x [Finished] a1c3d57a ursa-thinkcentre-m75q [Finished] bc52f9f0 ec2-t3-xlarge-us-east-2 [Finished] bc52f9f0 test-mac-arm [Failed] bc52f9f0 ursa-i9-9960x [Finished] bc52f9f0 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Aug 18 '22 21:08 ursabot

['Python', 'R'] benchmarks have high level of regressions. ursa-i9-9960x

ursabot avatar Aug 18 '22 21:08 ursabot