arrow
arrow copied to clipboard
ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type
Improve parquet reading performance for String/Binary type based on Buffer operations instead of BinaryArrayBuilder
https://issues.apache.org/jira/browse/ARROW-17735
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?
Would you mind rebasing and cleaning up any commented code?
In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:
-
src/parquet/arrow/reader_writer_benchmark.cc
-
src/parquet/parquet_encoding_benchmark.cc
We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:
BUILD_DIR=cpp/path/to/build
archery benchmark run $BUILD_DIR \
--suite-filter=parquet-arrow-reader-writer-benchmark \
--output=contender.json
git checkout master # Or some ref that has benchmarks but not your changes
archery benchmark run $BUILD_DIR \
--suite-filter=parquet-arrow-reader-writer-benchmark \
--output=baseline.json
archery benchmark diff contender.json baseline.json
@wjones127 , Will focus on this patch recently. Welcome review!
Thanks!
Hi @pitrou ,
Recently,have fixed remaining failed 7 UTs, and this patch is ready for reviw.
Thanks!
HI @jorisvandenbossche @wjones127 @pitrou @iajoiner
This PR is ready to review, please have a look. Thanks,
New failed UT "RecordReaderByteArrayTest.SkipByteArray" , came from newly commit #14142. This UT used the GetBuilderChunks() interface : https://github.com/apache/arrow/blame/6cfe24633b9fe3c474137571940eca35d7a475dc/cpp/src/parquet/column_reader_test.cc#L1181-L1185
And this performance PR is avoiding this interface. So it failed.
Any suggestions to fix this UT? Or change this UT?
Thanks!
Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a while. Are you still interested in working on this?
Would you mind rebasing and cleaning up any commented code?
In addition, I noticed there aren't any benchmarks for binary or string types in either of these two benchmark files:
src/parquet/arrow/reader_writer_benchmark.cc
src/parquet/parquet_encoding_benchmark.cc
We'll want some benchmarks in order to evaluate the changes. If you add those, then you can compare benchmarks locally like so:
BUILD_DIR=cpp/path/to/build archery benchmark run $BUILD_DIR \ --suite-filter=parquet-arrow-reader-writer-benchmark \ --output=contender.json git checkout master # Or some ref that has benchmarks but not your changes archery benchmark run $BUILD_DIR \ --suite-filter=parquet-arrow-reader-writer-benchmark \ --output=baseline.json archery benchmark diff contender.json baseline.json
Hi @wjones127 Good idea to show performance.
But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!
In my local server end-2-end performance testing:
CPU: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz OS :CentOS 7.6 Data: Single parquet file with dictionary-encoding, gzip compression, 100M Rows, 10 Cols Run:With one thread
Elapse time | performance | |
---|---|---|
Arrow-upstream | 27.3s | |
Arrow-optimizer | 12.2s | 2.23X |
And this performance PR is avoiding this interface. So it failed.
Any suggestions to fix this UT?
I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...
And this performance PR is avoiding this interface. So it failed. Any suggestions to fix this UT?
I'm not sure I understand your question. It's your job to make sure that your changes don't break the existing test suite...
Hi @pitrou , Have fixed this UT-- RecordReaderByteArrayTest.SkipByteArray, today. Thanks!
HI @cyb70289 @jorisvandenbossche Would you please give a review, Thanks!
Would you fix the CI failures?
Would you fix the CI failures? Thanks @cyb70289
I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.
# ./parquet-arrow-test --gtest_filter=TestArrowReadDeltaEncoding.DeltaByteArray
Running main() from /home/shen/software/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TestArrowReadDeltaEncoding.DeltaByteArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowReadDeltaEncoding
[ RUN ] TestArrowReadDeltaEncoding.DeltaByteArray
[ SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray (0 ms)
[----------] 1 test from TestArrowReadDeltaEncoding (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[ PASSED ] 0 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] TestArrowReadDeltaEncoding.DeltaByteArray
I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.
Don't know the details. But it should be easy to find in the source code.
Instead of defining entire separate classes for this, why not change EncodingTraits::Accumulator
to the following:
template <>
struct EncodingTraits<ByteArrayType> {
// ...
struct Accumulator {
std::unique_ptr<::arrow::Int32Builder> offsets_builder;
std::unique_ptr<::arrow::BufferBuilder> data_builder;
std::vector<std::shared_ptr<::arrow::Array>> chunks;
};
I am in fixing errors in github CI. For "TestArrowReadDeltaEncoding.DeltaByteArray", why is skipped in my local testing? But showed failure on CI.
Don't know the details. But it should be easy to find in the source code.
New commits should fix all UTs. Please give a review! Thanks!
Instead of defining entire separate classes for this, why not change
EncodingTraits::Accumulator
to the following:template <> struct EncodingTraits<ByteArrayType> { // ... struct Accumulator { std::unique_ptr<::arrow::Int32Builder> offsets_builder; std::unique_ptr<::arrow::BufferBuilder> data_builder; std::vector<std::shared_ptr<::arrow::Array>> chunks; };
If use *Builder, may add extra data copy when accumulating element. Can refer to fixed width usage : https://github.com/apache/arrow/blob/5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d/cpp/src/parquet/encoding.cc#L1062-L1070
Thanks!
cc @pitrou @jorisvandenbossche @wjones127 @cyb70289 @iajoiner ,
Thanks!
cc @raulcd @wesm @lidavidm
Good idea to show performance.
But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!
Hi @cyb70289 , Thanks your kind review very much!
Good idea to show performance.
But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!
But how to add Binary/String benchmark in *benchmark.cc? I noticed currently Arrow benchmark testings had not support generating random binary data. Would you please give some help or advices about adding String/binary in micro benchmarks, thanks!
There are many examples you can reference. cpp/src/arrow/testing/random.h
defines functions to generate different kinds of random test set.
Please note it can be a separate PR to add new benchmarks, which can be reviewed and merged before this one.
@cyb70289 @wjones127 During the last two weeks, was suffering high fever. Sorry! Now, Will continue to support this patch!
Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.
Sorry to hear you were sick @zhixingheyi-tian. I'll soon have a benchmark merged (#15100), and that will give us relevant results in Conbench that will help the evaluate these changes.
Thanks!
@ursabot please benchmark command=cpp-micro --suite-filter=parquet-arrow-reader-writer-benchmark
Benchmark runs are scheduled for baseline = f82501e763ff48af610077f9525ae83cc3ab2e95 and contender = 19e2ba3b4bf6aca344cfa179f194163d6c3ef089. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped :warning: Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped :warning: Only ['lang', 'name'] filters are supported on test-mac-arm] test-mac-arm
[Skipped :warning: Only ['lang', 'name'] filters are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished :arrow_down:0.1% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 19e2ba3b
ursa-thinkcentre-m75q
[Finished] f82501e7
ec2-t3-xlarge-us-east-2
[Failed] f82501e7
test-mac-arm
[Finished] f82501e7
ursa-i9-9960x
[Finished] f82501e7
ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:
https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742
Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?
@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.
@zhixingheyi-tian Once you've fixed the issues mentioned, could you also rebase on the latest master branch? Once you do that we can run the benchmark.
Have rebased. Performance can refer to https://github.com/apache/arrow/pull/14353#issue-1402114514
The failures in the continuous-integration/appveyor/pr seem relevant. This test is failing:
https://github.com/apache/arrow/blob/85b167c05c2f93a95b23e8ac4fd4da576ea5b899/python/pyarrow/tests/parquet/test_basic.py#L715-L742
Perhaps we need to make sure that if there are null values that we don't copy those parts of the buffer?
I haven't touched pyarrow before. pyarrow underlayer is arrow-cpp, right? How to reproduce the issue easily in cpp? Thanks!